| 
									
										
										
										
											2005-06-23 00:07:49 -07:00
										 |  |  | config SELECT_MEMORY_MODEL | 
					
						
							|  |  |  | 	def_bool y | 
					
						
							|  |  |  | 	depends on EXPERIMENTAL || ARCH_SELECT_MEMORY_MODEL | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2005-06-23 00:07:42 -07:00
										 |  |  | choice | 
					
						
							|  |  |  | 	prompt "Memory model" | 
					
						
							| 
									
										
										
										
											2005-06-23 00:07:49 -07:00
										 |  |  | 	depends on SELECT_MEMORY_MODEL | 
					
						
							|  |  |  | 	default DISCONTIGMEM_MANUAL if ARCH_DISCONTIGMEM_DEFAULT | 
					
						
							| 
									
										
											  
											
												[PATCH] sparsemem memory model
Sparsemem abstracts the use of discontiguous mem_maps[].  This kind of
mem_map[] is needed by discontiguous memory machines (like in the old
CONFIG_DISCONTIGMEM case) as well as memory hotplug systems.  Sparsemem
replaces DISCONTIGMEM when enabled, and it is hoped that it can eventually
become a complete replacement.
A significant advantage over DISCONTIGMEM is that it's completely separated
from CONFIG_NUMA.  When producing this patch, it became apparent in that NUMA
and DISCONTIG are often confused.
Another advantage is that sparse doesn't require each NUMA node's ranges to be
contiguous.  It can handle overlapping ranges between nodes with no problems,
where DISCONTIGMEM currently throws away that memory.
Sparsemem uses an array to provide different pfn_to_page() translations for
each SECTION_SIZE area of physical memory.  This is what allows the mem_map[]
to be chopped up.
In order to do quick pfn_to_page() operations, the section number of the page
is encoded in page->flags.  Part of the sparsemem infrastructure enables
sharing of these bits more dynamically (at compile-time) between the
page_zone() and sparsemem operations.  However, on 32-bit architectures, the
number of bits is quite limited, and may require growing the size of the
page->flags type in certain conditions.  Several things might force this to
occur: a decrease in the SECTION_SIZE (if you want to hotplug smaller areas of
memory), an increase in the physical address space, or an increase in the
number of used page->flags.
One thing to note is that, once sparsemem is present, the NUMA node
information no longer needs to be stored in the page->flags.  It might provide
speed increases on certain platforms and will be stored there if there is
room.  But, if out of room, an alternate (theoretically slower) mechanism is
used.
This patch introduces CONFIG_FLATMEM.  It is used in almost all cases where
there used to be an #ifndef DISCONTIG, because SPARSEMEM and DISCONTIGMEM
often have to compile out the same areas of code.
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Martin Bligh <mbligh@aracnet.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Bob Picco <bob.picco@hp.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
											
										 
											2005-06-23 00:07:54 -07:00
										 |  |  | 	default SPARSEMEM_MANUAL if ARCH_SPARSEMEM_DEFAULT | 
					
						
							| 
									
										
										
										
											2005-06-23 00:07:49 -07:00
										 |  |  | 	default FLATMEM_MANUAL | 
					
						
							| 
									
										
										
										
											2005-06-23 00:07:42 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2005-06-23 00:07:49 -07:00
										 |  |  | config FLATMEM_MANUAL | 
					
						
							| 
									
										
										
										
											2005-06-23 00:07:42 -07:00
										 |  |  | 	bool "Flat Memory" | 
					
						
							| 
									
										
										
										
											2006-01-06 00:12:07 -08:00
										 |  |  | 	depends on !(ARCH_DISCONTIGMEM_ENABLE || ARCH_SPARSEMEM_ENABLE) || ARCH_FLATMEM_ENABLE | 
					
						
							| 
									
										
										
										
											2005-06-23 00:07:42 -07:00
										 |  |  | 	help | 
					
						
							|  |  |  | 	  This option allows you to change some of the ways that | 
					
						
							|  |  |  | 	  Linux manages its memory internally.  Most users will | 
					
						
							|  |  |  | 	  only have one option here: FLATMEM.  This is normal | 
					
						
							|  |  |  | 	  and a correct option. | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
											  
											
												[PATCH] sparsemem memory model
Sparsemem abstracts the use of discontiguous mem_maps[].  This kind of
mem_map[] is needed by discontiguous memory machines (like in the old
CONFIG_DISCONTIGMEM case) as well as memory hotplug systems.  Sparsemem
replaces DISCONTIGMEM when enabled, and it is hoped that it can eventually
become a complete replacement.
A significant advantage over DISCONTIGMEM is that it's completely separated
from CONFIG_NUMA.  When producing this patch, it became apparent in that NUMA
and DISCONTIG are often confused.
Another advantage is that sparse doesn't require each NUMA node's ranges to be
contiguous.  It can handle overlapping ranges between nodes with no problems,
where DISCONTIGMEM currently throws away that memory.
Sparsemem uses an array to provide different pfn_to_page() translations for
each SECTION_SIZE area of physical memory.  This is what allows the mem_map[]
to be chopped up.
In order to do quick pfn_to_page() operations, the section number of the page
is encoded in page->flags.  Part of the sparsemem infrastructure enables
sharing of these bits more dynamically (at compile-time) between the
page_zone() and sparsemem operations.  However, on 32-bit architectures, the
number of bits is quite limited, and may require growing the size of the
page->flags type in certain conditions.  Several things might force this to
occur: a decrease in the SECTION_SIZE (if you want to hotplug smaller areas of
memory), an increase in the physical address space, or an increase in the
number of used page->flags.
One thing to note is that, once sparsemem is present, the NUMA node
information no longer needs to be stored in the page->flags.  It might provide
speed increases on certain platforms and will be stored there if there is
room.  But, if out of room, an alternate (theoretically slower) mechanism is
used.
This patch introduces CONFIG_FLATMEM.  It is used in almost all cases where
there used to be an #ifndef DISCONTIG, because SPARSEMEM and DISCONTIGMEM
often have to compile out the same areas of code.
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Martin Bligh <mbligh@aracnet.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Bob Picco <bob.picco@hp.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
											
										 
											2005-06-23 00:07:54 -07:00
										 |  |  | 	  Some users of more advanced features like NUMA and | 
					
						
							|  |  |  | 	  memory hotplug may have different options here. | 
					
						
							|  |  |  | 	  DISCONTIGMEM is an more mature, better tested system, | 
					
						
							|  |  |  | 	  but is incompatible with memory hotplug and may suffer | 
					
						
							|  |  |  | 	  decreased performance over SPARSEMEM.  If unsure between | 
					
						
							|  |  |  | 	  "Sparse Memory" and "Discontiguous Memory", choose | 
					
						
							|  |  |  | 	  "Discontiguous Memory". | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	  If unsure, choose this option (Flat Memory) over any other. | 
					
						
							| 
									
										
										
										
											2005-06-23 00:07:42 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2005-06-23 00:07:49 -07:00
										 |  |  | config DISCONTIGMEM_MANUAL | 
					
						
							| 
									
										
										
										
											2005-09-16 19:27:54 -07:00
										 |  |  | 	bool "Discontiguous Memory" | 
					
						
							| 
									
										
										
										
											2005-06-23 00:07:42 -07:00
										 |  |  | 	depends on ARCH_DISCONTIGMEM_ENABLE | 
					
						
							|  |  |  | 	help | 
					
						
							| 
									
										
										
										
											2005-06-23 00:07:50 -07:00
										 |  |  | 	  This option provides enhanced support for discontiguous | 
					
						
							|  |  |  | 	  memory systems, over FLATMEM.  These systems have holes | 
					
						
							|  |  |  | 	  in their physical address spaces, and this option provides | 
					
						
							|  |  |  | 	  more efficient handling of these holes.  However, the vast | 
					
						
							|  |  |  | 	  majority of hardware has quite flat address spaces, and | 
					
						
							| 
									
										
										
										
											2007-10-20 02:46:58 +02:00
										 |  |  | 	  can have degraded performance from the extra overhead that | 
					
						
							| 
									
										
										
										
											2005-06-23 00:07:50 -07:00
										 |  |  | 	  this option imposes. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	  Many NUMA configurations will have this as the only option. | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2005-06-23 00:07:42 -07:00
										 |  |  | 	  If unsure, choose "Flat Memory" over this option. | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
											  
											
												[PATCH] sparsemem memory model
Sparsemem abstracts the use of discontiguous mem_maps[].  This kind of
mem_map[] is needed by discontiguous memory machines (like in the old
CONFIG_DISCONTIGMEM case) as well as memory hotplug systems.  Sparsemem
replaces DISCONTIGMEM when enabled, and it is hoped that it can eventually
become a complete replacement.
A significant advantage over DISCONTIGMEM is that it's completely separated
from CONFIG_NUMA.  When producing this patch, it became apparent in that NUMA
and DISCONTIG are often confused.
Another advantage is that sparse doesn't require each NUMA node's ranges to be
contiguous.  It can handle overlapping ranges between nodes with no problems,
where DISCONTIGMEM currently throws away that memory.
Sparsemem uses an array to provide different pfn_to_page() translations for
each SECTION_SIZE area of physical memory.  This is what allows the mem_map[]
to be chopped up.
In order to do quick pfn_to_page() operations, the section number of the page
is encoded in page->flags.  Part of the sparsemem infrastructure enables
sharing of these bits more dynamically (at compile-time) between the
page_zone() and sparsemem operations.  However, on 32-bit architectures, the
number of bits is quite limited, and may require growing the size of the
page->flags type in certain conditions.  Several things might force this to
occur: a decrease in the SECTION_SIZE (if you want to hotplug smaller areas of
memory), an increase in the physical address space, or an increase in the
number of used page->flags.
One thing to note is that, once sparsemem is present, the NUMA node
information no longer needs to be stored in the page->flags.  It might provide
speed increases on certain platforms and will be stored there if there is
room.  But, if out of room, an alternate (theoretically slower) mechanism is
used.
This patch introduces CONFIG_FLATMEM.  It is used in almost all cases where
there used to be an #ifndef DISCONTIG, because SPARSEMEM and DISCONTIGMEM
often have to compile out the same areas of code.
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Martin Bligh <mbligh@aracnet.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Bob Picco <bob.picco@hp.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
											
										 
											2005-06-23 00:07:54 -07:00
										 |  |  | config SPARSEMEM_MANUAL | 
					
						
							|  |  |  | 	bool "Sparse Memory" | 
					
						
							|  |  |  | 	depends on ARCH_SPARSEMEM_ENABLE | 
					
						
							|  |  |  | 	help | 
					
						
							|  |  |  | 	  This will be the only option for some systems, including | 
					
						
							|  |  |  | 	  memory hotplug systems.  This is normal. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	  For many other systems, this will be an alternative to | 
					
						
							| 
									
										
										
										
											2005-09-16 19:27:54 -07:00
										 |  |  | 	  "Discontiguous Memory".  This option provides some potential | 
					
						
							| 
									
										
											  
											
												[PATCH] sparsemem memory model
Sparsemem abstracts the use of discontiguous mem_maps[].  This kind of
mem_map[] is needed by discontiguous memory machines (like in the old
CONFIG_DISCONTIGMEM case) as well as memory hotplug systems.  Sparsemem
replaces DISCONTIGMEM when enabled, and it is hoped that it can eventually
become a complete replacement.
A significant advantage over DISCONTIGMEM is that it's completely separated
from CONFIG_NUMA.  When producing this patch, it became apparent in that NUMA
and DISCONTIG are often confused.
Another advantage is that sparse doesn't require each NUMA node's ranges to be
contiguous.  It can handle overlapping ranges between nodes with no problems,
where DISCONTIGMEM currently throws away that memory.
Sparsemem uses an array to provide different pfn_to_page() translations for
each SECTION_SIZE area of physical memory.  This is what allows the mem_map[]
to be chopped up.
In order to do quick pfn_to_page() operations, the section number of the page
is encoded in page->flags.  Part of the sparsemem infrastructure enables
sharing of these bits more dynamically (at compile-time) between the
page_zone() and sparsemem operations.  However, on 32-bit architectures, the
number of bits is quite limited, and may require growing the size of the
page->flags type in certain conditions.  Several things might force this to
occur: a decrease in the SECTION_SIZE (if you want to hotplug smaller areas of
memory), an increase in the physical address space, or an increase in the
number of used page->flags.
One thing to note is that, once sparsemem is present, the NUMA node
information no longer needs to be stored in the page->flags.  It might provide
speed increases on certain platforms and will be stored there if there is
room.  But, if out of room, an alternate (theoretically slower) mechanism is
used.
This patch introduces CONFIG_FLATMEM.  It is used in almost all cases where
there used to be an #ifndef DISCONTIG, because SPARSEMEM and DISCONTIGMEM
often have to compile out the same areas of code.
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Martin Bligh <mbligh@aracnet.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Bob Picco <bob.picco@hp.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
											
										 
											2005-06-23 00:07:54 -07:00
										 |  |  | 	  performance benefits, along with decreased code complexity, | 
					
						
							|  |  |  | 	  but it is newer, and more experimental. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	  If unsure, choose "Discontiguous Memory" or "Flat Memory" | 
					
						
							|  |  |  | 	  over this option. | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2005-06-23 00:07:42 -07:00
										 |  |  | endchoice | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2005-06-23 00:07:49 -07:00
										 |  |  | config DISCONTIGMEM | 
					
						
							|  |  |  | 	def_bool y | 
					
						
							|  |  |  | 	depends on (!SELECT_MEMORY_MODEL && ARCH_DISCONTIGMEM_ENABLE) || DISCONTIGMEM_MANUAL | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
											  
											
												[PATCH] sparsemem memory model
Sparsemem abstracts the use of discontiguous mem_maps[].  This kind of
mem_map[] is needed by discontiguous memory machines (like in the old
CONFIG_DISCONTIGMEM case) as well as memory hotplug systems.  Sparsemem
replaces DISCONTIGMEM when enabled, and it is hoped that it can eventually
become a complete replacement.
A significant advantage over DISCONTIGMEM is that it's completely separated
from CONFIG_NUMA.  When producing this patch, it became apparent in that NUMA
and DISCONTIG are often confused.
Another advantage is that sparse doesn't require each NUMA node's ranges to be
contiguous.  It can handle overlapping ranges between nodes with no problems,
where DISCONTIGMEM currently throws away that memory.
Sparsemem uses an array to provide different pfn_to_page() translations for
each SECTION_SIZE area of physical memory.  This is what allows the mem_map[]
to be chopped up.
In order to do quick pfn_to_page() operations, the section number of the page
is encoded in page->flags.  Part of the sparsemem infrastructure enables
sharing of these bits more dynamically (at compile-time) between the
page_zone() and sparsemem operations.  However, on 32-bit architectures, the
number of bits is quite limited, and may require growing the size of the
page->flags type in certain conditions.  Several things might force this to
occur: a decrease in the SECTION_SIZE (if you want to hotplug smaller areas of
memory), an increase in the physical address space, or an increase in the
number of used page->flags.
One thing to note is that, once sparsemem is present, the NUMA node
information no longer needs to be stored in the page->flags.  It might provide
speed increases on certain platforms and will be stored there if there is
room.  But, if out of room, an alternate (theoretically slower) mechanism is
used.
This patch introduces CONFIG_FLATMEM.  It is used in almost all cases where
there used to be an #ifndef DISCONTIG, because SPARSEMEM and DISCONTIGMEM
often have to compile out the same areas of code.
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Martin Bligh <mbligh@aracnet.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Bob Picco <bob.picco@hp.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
											
										 
											2005-06-23 00:07:54 -07:00
										 |  |  | config SPARSEMEM | 
					
						
							|  |  |  | 	def_bool y | 
					
						
							|  |  |  | 	depends on SPARSEMEM_MANUAL | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2005-06-23 00:07:49 -07:00
										 |  |  | config FLATMEM | 
					
						
							|  |  |  | 	def_bool y | 
					
						
							| 
									
										
											  
											
												[PATCH] sparsemem memory model
Sparsemem abstracts the use of discontiguous mem_maps[].  This kind of
mem_map[] is needed by discontiguous memory machines (like in the old
CONFIG_DISCONTIGMEM case) as well as memory hotplug systems.  Sparsemem
replaces DISCONTIGMEM when enabled, and it is hoped that it can eventually
become a complete replacement.
A significant advantage over DISCONTIGMEM is that it's completely separated
from CONFIG_NUMA.  When producing this patch, it became apparent in that NUMA
and DISCONTIG are often confused.
Another advantage is that sparse doesn't require each NUMA node's ranges to be
contiguous.  It can handle overlapping ranges between nodes with no problems,
where DISCONTIGMEM currently throws away that memory.
Sparsemem uses an array to provide different pfn_to_page() translations for
each SECTION_SIZE area of physical memory.  This is what allows the mem_map[]
to be chopped up.
In order to do quick pfn_to_page() operations, the section number of the page
is encoded in page->flags.  Part of the sparsemem infrastructure enables
sharing of these bits more dynamically (at compile-time) between the
page_zone() and sparsemem operations.  However, on 32-bit architectures, the
number of bits is quite limited, and may require growing the size of the
page->flags type in certain conditions.  Several things might force this to
occur: a decrease in the SECTION_SIZE (if you want to hotplug smaller areas of
memory), an increase in the physical address space, or an increase in the
number of used page->flags.
One thing to note is that, once sparsemem is present, the NUMA node
information no longer needs to be stored in the page->flags.  It might provide
speed increases on certain platforms and will be stored there if there is
room.  But, if out of room, an alternate (theoretically slower) mechanism is
used.
This patch introduces CONFIG_FLATMEM.  It is used in almost all cases where
there used to be an #ifndef DISCONTIG, because SPARSEMEM and DISCONTIGMEM
often have to compile out the same areas of code.
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Martin Bligh <mbligh@aracnet.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Bob Picco <bob.picco@hp.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
											
										 
											2005-06-23 00:07:54 -07:00
										 |  |  | 	depends on (!DISCONTIGMEM && !SPARSEMEM) || FLATMEM_MANUAL | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | config FLAT_NODE_MEM_MAP | 
					
						
							|  |  |  | 	def_bool y | 
					
						
							|  |  |  | 	depends on !SPARSEMEM | 
					
						
							| 
									
										
										
										
											2005-06-23 00:07:49 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2005-06-23 00:07:47 -07:00
										 |  |  | # | 
					
						
							|  |  |  | # Both the NUMA code and DISCONTIGMEM use arrays of pg_data_t's | 
					
						
							|  |  |  | # to represent different areas of memory.  This variable allows | 
					
						
							|  |  |  | # those dependencies to exist individually. | 
					
						
							|  |  |  | # | 
					
						
							|  |  |  | config NEED_MULTIPLE_NODES | 
					
						
							|  |  |  | 	def_bool y | 
					
						
							|  |  |  | 	depends on DISCONTIGMEM || NUMA | 
					
						
							| 
									
										
										
										
											2005-06-23 00:07:53 -07:00
										 |  |  | 
 | 
					
						
							|  |  |  | config HAVE_MEMORY_PRESENT | 
					
						
							|  |  |  | 	def_bool y | 
					
						
							| 
									
										
											  
											
												[PATCH] sparsemem memory model
Sparsemem abstracts the use of discontiguous mem_maps[].  This kind of
mem_map[] is needed by discontiguous memory machines (like in the old
CONFIG_DISCONTIGMEM case) as well as memory hotplug systems.  Sparsemem
replaces DISCONTIGMEM when enabled, and it is hoped that it can eventually
become a complete replacement.
A significant advantage over DISCONTIGMEM is that it's completely separated
from CONFIG_NUMA.  When producing this patch, it became apparent in that NUMA
and DISCONTIG are often confused.
Another advantage is that sparse doesn't require each NUMA node's ranges to be
contiguous.  It can handle overlapping ranges between nodes with no problems,
where DISCONTIGMEM currently throws away that memory.
Sparsemem uses an array to provide different pfn_to_page() translations for
each SECTION_SIZE area of physical memory.  This is what allows the mem_map[]
to be chopped up.
In order to do quick pfn_to_page() operations, the section number of the page
is encoded in page->flags.  Part of the sparsemem infrastructure enables
sharing of these bits more dynamically (at compile-time) between the
page_zone() and sparsemem operations.  However, on 32-bit architectures, the
number of bits is quite limited, and may require growing the size of the
page->flags type in certain conditions.  Several things might force this to
occur: a decrease in the SECTION_SIZE (if you want to hotplug smaller areas of
memory), an increase in the physical address space, or an increase in the
number of used page->flags.
One thing to note is that, once sparsemem is present, the NUMA node
information no longer needs to be stored in the page->flags.  It might provide
speed increases on certain platforms and will be stored there if there is
room.  But, if out of room, an alternate (theoretically slower) mechanism is
used.
This patch introduces CONFIG_FLATMEM.  It is used in almost all cases where
there used to be an #ifndef DISCONTIG, because SPARSEMEM and DISCONTIGMEM
often have to compile out the same areas of code.
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Martin Bligh <mbligh@aracnet.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Bob Picco <bob.picco@hp.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
											
										 
											2005-06-23 00:07:54 -07:00
										 |  |  | 	depends on ARCH_HAVE_MEMORY_PRESENT || SPARSEMEM | 
					
						
							| 
									
										
										
										
											2005-09-03 15:54:26 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2005-09-03 15:54:28 -07:00
										 |  |  | # | 
					
						
							|  |  |  | # SPARSEMEM_EXTREME (which is the default) does some bootmem | 
					
						
							| 
									
										
										
										
											2006-10-03 22:53:09 +02:00
										 |  |  | # allocations when memory_present() is called.  If this cannot | 
					
						
							| 
									
										
										
										
											2005-09-03 15:54:28 -07:00
										 |  |  | # be done on your architecture, select this option.  However, | 
					
						
							|  |  |  | # statically allocating the mem_section[] array can potentially | 
					
						
							|  |  |  | # consume vast quantities of .bss, so be careful. | 
					
						
							|  |  |  | # | 
					
						
							|  |  |  | # This option will also potentially produce smaller runtime code | 
					
						
							|  |  |  | # with gcc 3.4 and later. | 
					
						
							|  |  |  | # | 
					
						
							|  |  |  | config SPARSEMEM_STATIC | 
					
						
							| 
									
										
										
										
											2008-10-15 22:01:38 -07:00
										 |  |  | 	bool | 
					
						
							| 
									
										
										
										
											2005-09-03 15:54:28 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2005-09-03 15:54:26 -07:00
										 |  |  | # | 
					
						
							| 
									
										
										
										
											2006-10-03 22:34:14 +02:00
										 |  |  | # Architecture platforms which require a two level mem_section in SPARSEMEM | 
					
						
							| 
									
										
										
										
											2005-09-03 15:54:26 -07:00
										 |  |  | # must select this option. This is usually for architecture platforms with | 
					
						
							|  |  |  | # an extremely sparse physical address space. | 
					
						
							|  |  |  | # | 
					
						
							| 
									
										
										
										
											2005-09-03 15:54:28 -07:00
										 |  |  | config SPARSEMEM_EXTREME | 
					
						
							|  |  |  | 	def_bool y | 
					
						
							|  |  |  | 	depends on SPARSEMEM && !SPARSEMEM_STATIC | 
					
						
							| 
									
										
											  
											
												[PATCH] mm: split page table lock
Christoph Lameter demonstrated very poor scalability on the SGI 512-way, with
a many-threaded application which concurrently initializes different parts of
a large anonymous area.
This patch corrects that, by using a separate spinlock per page table page, to
guard the page table entries in that page, instead of using the mm's single
page_table_lock.  (But even then, page_table_lock is still used to guard page
table allocation, and anon_vma allocation.)
In this implementation, the spinlock is tucked inside the struct page of the
page table page: with a BUILD_BUG_ON in case it overflows - which it would in
the case of 32-bit PA-RISC with spinlock debugging enabled.
Splitting the lock is not quite for free: another cacheline access.  Ideally,
I suppose we would use split ptlock only for multi-threaded processes on
multi-cpu machines; but deciding that dynamically would have its own costs.
So for now enable it by config, at some number of cpus - since the Kconfig
language doesn't support inequalities, let preprocessor compare that with
NR_CPUS.  But I don't think it's worth being user-configurable: for good
testing of both split and unsplit configs, split now at 4 cpus, and perhaps
change that to 8 later.
There is a benefit even for singly threaded processes: kswapd can be attacking
one part of the mm while another part is busy faulting.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
											
										 
											2005-10-29 18:16:40 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2007-10-16 01:24:14 -07:00
										 |  |  | config SPARSEMEM_VMEMMAP_ENABLE | 
					
						
							| 
									
										
										
										
											2008-10-15 22:01:38 -07:00
										 |  |  | 	bool | 
					
						
							| 
									
										
										
										
											2007-10-16 01:24:14 -07:00
										 |  |  | 
 | 
					
						
							|  |  |  | config SPARSEMEM_VMEMMAP | 
					
						
							| 
									
										
										
										
											2007-12-17 16:19:53 -08:00
										 |  |  | 	bool "Sparse Memory virtual memmap" | 
					
						
							|  |  |  | 	depends on SPARSEMEM && SPARSEMEM_VMEMMAP_ENABLE | 
					
						
							|  |  |  | 	default y | 
					
						
							|  |  |  | 	help | 
					
						
							|  |  |  | 	 SPARSEMEM_VMEMMAP uses a virtually mapped memmap to optimise | 
					
						
							|  |  |  | 	 pfn_to_page and page_to_pfn operations.  This is the most | 
					
						
							|  |  |  | 	 efficient option when sufficient kernel resources are available. | 
					
						
							| 
									
										
										
										
											2007-10-16 01:24:14 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2005-10-29 18:16:54 -07:00
										 |  |  | # eventually, we can have this option just 'select SPARSEMEM' | 
					
						
							|  |  |  | config MEMORY_HOTPLUG | 
					
						
							|  |  |  | 	bool "Allow for memory hot-add" | 
					
						
							| 
									
										
										
										
											2006-09-30 23:27:05 -07:00
										 |  |  | 	depends on SPARSEMEM || X86_64_ACPI_NUMA | 
					
						
							| 
									
										
										
										
											2009-06-16 10:30:48 +02:00
										 |  |  | 	depends on HOTPLUG && !(HIBERNATION && !S390) && ARCH_ENABLE_MEMORY_HOTPLUG | 
					
						
							| 
									
										
										
										
											2008-07-14 09:59:18 +02:00
										 |  |  | 	depends on (IA64 || X86 || PPC64 || SUPERH || S390) | 
					
						
							| 
									
										
										
										
											2005-10-29 18:16:54 -07:00
										 |  |  | 
 | 
					
						
							|  |  |  | comment "Memory hotplug is currently incompatible with Software Suspend" | 
					
						
							| 
									
										
										
										
											2009-06-16 10:30:48 +02:00
										 |  |  | 	depends on SPARSEMEM && HOTPLUG && HIBERNATION && !S390 | 
					
						
							| 
									
										
										
										
											2005-10-29 18:16:54 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2006-09-30 23:27:05 -07:00
										 |  |  | config MEMORY_HOTPLUG_SPARSE | 
					
						
							|  |  |  | 	def_bool y | 
					
						
							|  |  |  | 	depends on SPARSEMEM && MEMORY_HOTPLUG | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2007-10-16 01:26:12 -07:00
										 |  |  | config MEMORY_HOTREMOVE | 
					
						
							|  |  |  | 	bool "Allow for memory hot remove" | 
					
						
							|  |  |  | 	depends on MEMORY_HOTPLUG && ARCH_ENABLE_MEMORY_HOTREMOVE | 
					
						
							|  |  |  | 	depends on MIGRATION | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2008-04-28 02:12:55 -07:00
										 |  |  | # | 
					
						
							|  |  |  | # If we have space for more page flags then we can enable additional | 
					
						
							|  |  |  | # optimizations and functionality. | 
					
						
							|  |  |  | # | 
					
						
							|  |  |  | # Regular Sparsemem takes page flag bits for the sectionid if it does not | 
					
						
							|  |  |  | # use a virtual memmap. Disable extended page flags for 32 bit platforms | 
					
						
							|  |  |  | # that require the use of a sectionid in the page flags. | 
					
						
							|  |  |  | # | 
					
						
							|  |  |  | config PAGEFLAGS_EXTENDED | 
					
						
							|  |  |  | 	def_bool y | 
					
						
							|  |  |  | 	depends on 64BIT || SPARSEMEM_VMEMMAP || !NUMA || !SPARSEMEM | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
											  
											
												[PATCH] mm: split page table lock
Christoph Lameter demonstrated very poor scalability on the SGI 512-way, with
a many-threaded application which concurrently initializes different parts of
a large anonymous area.
This patch corrects that, by using a separate spinlock per page table page, to
guard the page table entries in that page, instead of using the mm's single
page_table_lock.  (But even then, page_table_lock is still used to guard page
table allocation, and anon_vma allocation.)
In this implementation, the spinlock is tucked inside the struct page of the
page table page: with a BUILD_BUG_ON in case it overflows - which it would in
the case of 32-bit PA-RISC with spinlock debugging enabled.
Splitting the lock is not quite for free: another cacheline access.  Ideally,
I suppose we would use split ptlock only for multi-threaded processes on
multi-cpu machines; but deciding that dynamically would have its own costs.
So for now enable it by config, at some number of cpus - since the Kconfig
language doesn't support inequalities, let preprocessor compare that with
NR_CPUS.  But I don't think it's worth being user-configurable: for good
testing of both split and unsplit configs, split now at 4 cpus, and perhaps
change that to 8 later.
There is a benefit even for singly threaded processes: kswapd can be attacking
one part of the mm while another part is busy faulting.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
											
										 
											2005-10-29 18:16:40 -07:00
										 |  |  | # Heavily threaded applications may benefit from splitting the mm-wide | 
					
						
							|  |  |  | # page_table_lock, so that faults on different parts of the user address | 
					
						
							|  |  |  | # space can be handled with less contention: split it at this NR_CPUS. | 
					
						
							|  |  |  | # Default to 4 for wider testing, though 8 might be more appropriate. | 
					
						
							|  |  |  | # ARM's adjust_pte (unused if VIPT) depends on mm-wide page_table_lock. | 
					
						
							| 
									
										
										
										
											2005-11-23 13:37:37 -08:00
										 |  |  | # PA-RISC 7xxx's spinlock_t would enlarge struct page from 32 to 44 bytes. | 
					
						
							| 
									
										
											  
											
												[PATCH] mm: split page table lock
Christoph Lameter demonstrated very poor scalability on the SGI 512-way, with
a many-threaded application which concurrently initializes different parts of
a large anonymous area.
This patch corrects that, by using a separate spinlock per page table page, to
guard the page table entries in that page, instead of using the mm's single
page_table_lock.  (But even then, page_table_lock is still used to guard page
table allocation, and anon_vma allocation.)
In this implementation, the spinlock is tucked inside the struct page of the
page table page: with a BUILD_BUG_ON in case it overflows - which it would in
the case of 32-bit PA-RISC with spinlock debugging enabled.
Splitting the lock is not quite for free: another cacheline access.  Ideally,
I suppose we would use split ptlock only for multi-threaded processes on
multi-cpu machines; but deciding that dynamically would have its own costs.
So for now enable it by config, at some number of cpus - since the Kconfig
language doesn't support inequalities, let preprocessor compare that with
NR_CPUS.  But I don't think it's worth being user-configurable: for good
testing of both split and unsplit configs, split now at 4 cpus, and perhaps
change that to 8 later.
There is a benefit even for singly threaded processes: kswapd can be attacking
one part of the mm while another part is busy faulting.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
											
										 
											2005-10-29 18:16:40 -07:00
										 |  |  | # | 
					
						
							|  |  |  | config SPLIT_PTLOCK_CPUS | 
					
						
							|  |  |  | 	int | 
					
						
							|  |  |  | 	default "4096" if ARM && !CPU_CACHE_VIPT | 
					
						
							| 
									
										
										
										
											2005-11-23 13:37:37 -08:00
										 |  |  | 	default "4096" if PARISC && !PA20 | 
					
						
							| 
									
										
											  
											
												[PATCH] mm: split page table lock
Christoph Lameter demonstrated very poor scalability on the SGI 512-way, with
a many-threaded application which concurrently initializes different parts of
a large anonymous area.
This patch corrects that, by using a separate spinlock per page table page, to
guard the page table entries in that page, instead of using the mm's single
page_table_lock.  (But even then, page_table_lock is still used to guard page
table allocation, and anon_vma allocation.)
In this implementation, the spinlock is tucked inside the struct page of the
page table page: with a BUILD_BUG_ON in case it overflows - which it would in
the case of 32-bit PA-RISC with spinlock debugging enabled.
Splitting the lock is not quite for free: another cacheline access.  Ideally,
I suppose we would use split ptlock only for multi-threaded processes on
multi-cpu machines; but deciding that dynamically would have its own costs.
So for now enable it by config, at some number of cpus - since the Kconfig
language doesn't support inequalities, let preprocessor compare that with
NR_CPUS.  But I don't think it's worth being user-configurable: for good
testing of both split and unsplit configs, split now at 4 cpus, and perhaps
change that to 8 later.
There is a benefit even for singly threaded processes: kswapd can be attacking
one part of the mm while another part is busy faulting.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
											
										 
											2005-10-29 18:16:40 -07:00
										 |  |  | 	default "4" | 
					
						
							| 
									
										
										
										
											2006-01-08 01:00:49 -08:00
										 |  |  | 
 | 
					
						
							|  |  |  | # | 
					
						
							|  |  |  | # support for page migration | 
					
						
							|  |  |  | # | 
					
						
							|  |  |  | config MIGRATION | 
					
						
							| 
									
										
										
										
											2006-03-22 00:09:12 -08:00
										 |  |  | 	bool "Page migration" | 
					
						
							| 
									
										
										
										
											2006-06-23 02:03:37 -07:00
										 |  |  | 	def_bool y | 
					
						
							| 
									
										
										
										
											2008-07-23 21:28:22 -07:00
										 |  |  | 	depends on NUMA || ARCH_ENABLE_MEMORY_HOTREMOVE | 
					
						
							| 
									
										
										
										
											2006-03-22 00:09:12 -08:00
										 |  |  | 	help | 
					
						
							|  |  |  | 	  Allows the migration of the physical location of pages of processes | 
					
						
							|  |  |  | 	  while the virtual addresses are not changed. This is useful for | 
					
						
							|  |  |  | 	  example on NUMA systems to put pages nearer to the processors accessing | 
					
						
							|  |  |  | 	  the page. | 
					
						
							| 
									
										
										
										
											2006-06-12 17:11:31 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2008-09-11 01:31:45 -07:00
										 |  |  | config PHYS_ADDR_T_64BIT | 
					
						
							|  |  |  | 	def_bool 64BIT || ARCH_PHYS_ADDR_T_64BIT | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2007-02-10 01:43:10 -08:00
										 |  |  | config ZONE_DMA_FLAG | 
					
						
							|  |  |  | 	int | 
					
						
							|  |  |  | 	default "0" if !ZONE_DMA | 
					
						
							|  |  |  | 	default "1" | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2007-07-17 04:03:37 -07:00
										 |  |  | config BOUNCE | 
					
						
							|  |  |  | 	def_bool y | 
					
						
							|  |  |  | 	depends on BLOCK && MMU && (ZONE_DMA || HIGHMEM) | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2007-05-06 14:49:50 -07:00
										 |  |  | config NR_QUICK | 
					
						
							|  |  |  | 	int | 
					
						
							|  |  |  | 	depends on QUICKLIST | 
					
						
							| 
									
										
										
										
											2008-01-14 23:35:32 +01:00
										 |  |  | 	default "2" if SUPERH || AVR32 | 
					
						
							| 
									
										
										
										
											2007-05-06 14:49:50 -07:00
										 |  |  | 	default "1" | 
					
						
							| 
									
										
										
										
											2007-07-15 23:40:05 -07:00
										 |  |  | 
 | 
					
						
							|  |  |  | config VIRT_TO_BUS | 
					
						
							|  |  |  | 	def_bool y | 
					
						
							|  |  |  | 	depends on !ARCH_NO_VIRT_TO_BUS | 
					
						
							| 
									
										
											  
											
												mmu-notifiers: core
With KVM/GFP/XPMEM there isn't just the primary CPU MMU pointing to pages.
 There are secondary MMUs (with secondary sptes and secondary tlbs) too.
sptes in the kvm case are shadow pagetables, but when I say spte in
mmu-notifier context, I mean "secondary pte".  In GRU case there's no
actual secondary pte and there's only a secondary tlb because the GRU
secondary MMU has no knowledge about sptes and every secondary tlb miss
event in the MMU always generates a page fault that has to be resolved by
the CPU (this is not the case of KVM where the a secondary tlb miss will
walk sptes in hardware and it will refill the secondary tlb transparently
to software if the corresponding spte is present).  The same way
zap_page_range has to invalidate the pte before freeing the page, the spte
(and secondary tlb) must also be invalidated before any page is freed and
reused.
Currently we take a page_count pin on every page mapped by sptes, but that
means the pages can't be swapped whenever they're mapped by any spte
because they're part of the guest working set.  Furthermore a spte unmap
event can immediately lead to a page to be freed when the pin is released
(so requiring the same complex and relatively slow tlb_gather smp safe
logic we have in zap_page_range and that can be avoided completely if the
spte unmap event doesn't require an unpin of the page previously mapped in
the secondary MMU).
The mmu notifiers allow kvm/GRU/XPMEM to attach to the tsk->mm and know
when the VM is swapping or freeing or doing anything on the primary MMU so
that the secondary MMU code can drop sptes before the pages are freed,
avoiding all page pinning and allowing 100% reliable swapping of guest
physical address space.  Furthermore it avoids the code that teardown the
mappings of the secondary MMU, to implement a logic like tlb_gather in
zap_page_range that would require many IPI to flush other cpu tlbs, for
each fixed number of spte unmapped.
To make an example: if what happens on the primary MMU is a protection
downgrade (from writeable to wrprotect) the secondary MMU mappings will be
invalidated, and the next secondary-mmu-page-fault will call
get_user_pages and trigger a do_wp_page through get_user_pages if it
called get_user_pages with write=1, and it'll re-establishing an updated
spte or secondary-tlb-mapping on the copied page.  Or it will setup a
readonly spte or readonly tlb mapping if it's a guest-read, if it calls
get_user_pages with write=0.  This is just an example.
This allows to map any page pointed by any pte (and in turn visible in the
primary CPU MMU), into a secondary MMU (be it a pure tlb like GRU, or an
full MMU with both sptes and secondary-tlb like the shadow-pagetable layer
with kvm), or a remote DMA in software like XPMEM (hence needing of
schedule in XPMEM code to send the invalidate to the remote node, while no
need to schedule in kvm/gru as it's an immediate event like invalidating
primary-mmu pte).
At least for KVM without this patch it's impossible to swap guests
reliably.  And having this feature and removing the page pin allows
several other optimizations that simplify life considerably.
Dependencies:
1) mm_take_all_locks() to register the mmu notifier when the whole VM
   isn't doing anything with "mm".  This allows mmu notifier users to keep
   track if the VM is in the middle of the invalidate_range_begin/end
   critical section with an atomic counter incraese in range_begin and
   decreased in range_end.  No secondary MMU page fault is allowed to map
   any spte or secondary tlb reference, while the VM is in the middle of
   range_begin/end as any page returned by get_user_pages in that critical
   section could later immediately be freed without any further
   ->invalidate_page notification (invalidate_range_begin/end works on
   ranges and ->invalidate_page isn't called immediately before freeing
   the page).  To stop all page freeing and pagetable overwrites the
   mmap_sem must be taken in write mode and all other anon_vma/i_mmap
   locks must be taken too.
2) It'd be a waste to add branches in the VM if nobody could possibly
   run KVM/GRU/XPMEM on the kernel, so mmu notifiers will only enabled if
   CONFIG_KVM=m/y.  In the current kernel kvm won't yet take advantage of
   mmu notifiers, but this already allows to compile a KVM external module
   against a kernel with mmu notifiers enabled and from the next pull from
   kvm.git we'll start using them.  And GRU/XPMEM will also be able to
   continue the development by enabling KVM=m in their config, until they
   submit all GRU/XPMEM GPLv2 code to the mainline kernel.  Then they can
   also enable MMU_NOTIFIERS in the same way KVM does it (even if KVM=n).
   This guarantees nobody selects MMU_NOTIFIER=y if KVM and GRU and XPMEM
   are all =n.
The mmu_notifier_register call can fail because mm_take_all_locks may be
interrupted by a signal and return -EINTR.  Because mmu_notifier_reigster
is used when a driver startup, a failure can be gracefully handled.  Here
an example of the change applied to kvm to register the mmu notifiers.
Usually when a driver startups other allocations are required anyway and
-ENOMEM failure paths exists already.
 struct  kvm *kvm_arch_create_vm(void)
 {
        struct kvm *kvm = kzalloc(sizeof(struct kvm), GFP_KERNEL);
+       int err;
        if (!kvm)
                return ERR_PTR(-ENOMEM);
        INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);
+       kvm->arch.mmu_notifier.ops = &kvm_mmu_notifier_ops;
+       err = mmu_notifier_register(&kvm->arch.mmu_notifier, current->mm);
+       if (err) {
+               kfree(kvm);
+               return ERR_PTR(err);
+       }
+
        return kvm;
 }
mmu_notifier_unregister returns void and it's reliable.
The patch also adds a few needed but missing includes that would prevent
kernel to compile after these changes on non-x86 archs (x86 didn't need
them by luck).
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix mm/filemap_xip.c build]
[akpm@linux-foundation.org: fix mm/mmu_notifier.c build]
Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Jack Steiner <steiner@sgi.com>
Cc: Robin Holt <holt@sgi.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Kanoj Sarcar <kanojsarcar@yahoo.com>
Cc: Roland Dreier <rdreier@cisco.com>
Cc: Steve Wise <swise@opengridcomputing.com>
Cc: Avi Kivity <avi@qumranet.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Anthony Liguori <aliguori@us.ibm.com>
Cc: Chris Wright <chrisw@redhat.com>
Cc: Marcelo Tosatti <marcelo@kvack.org>
Cc: Eric Dumazet <dada1@cosmosbay.com>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Cc: Izik Eidus <izike@qumranet.com>
Cc: Anthony Liguori <aliguori@us.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2008-07-28 15:46:29 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2009-03-31 15:23:26 -07:00
										 |  |  | config HAVE_MLOCK | 
					
						
							|  |  |  | 	bool | 
					
						
							|  |  |  | 	default y if MMU=y | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | config HAVE_MLOCKED_PAGE_BIT | 
					
						
							|  |  |  | 	bool | 
					
						
							| 
									
										
										
										
											2009-06-16 15:32:51 -07:00
										 |  |  | 	default y if HAVE_MLOCK=y | 
					
						
							| 
									
										
										
										
											2009-03-31 15:23:26 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
											  
											
												mmu-notifiers: core
With KVM/GFP/XPMEM there isn't just the primary CPU MMU pointing to pages.
 There are secondary MMUs (with secondary sptes and secondary tlbs) too.
sptes in the kvm case are shadow pagetables, but when I say spte in
mmu-notifier context, I mean "secondary pte".  In GRU case there's no
actual secondary pte and there's only a secondary tlb because the GRU
secondary MMU has no knowledge about sptes and every secondary tlb miss
event in the MMU always generates a page fault that has to be resolved by
the CPU (this is not the case of KVM where the a secondary tlb miss will
walk sptes in hardware and it will refill the secondary tlb transparently
to software if the corresponding spte is present).  The same way
zap_page_range has to invalidate the pte before freeing the page, the spte
(and secondary tlb) must also be invalidated before any page is freed and
reused.
Currently we take a page_count pin on every page mapped by sptes, but that
means the pages can't be swapped whenever they're mapped by any spte
because they're part of the guest working set.  Furthermore a spte unmap
event can immediately lead to a page to be freed when the pin is released
(so requiring the same complex and relatively slow tlb_gather smp safe
logic we have in zap_page_range and that can be avoided completely if the
spte unmap event doesn't require an unpin of the page previously mapped in
the secondary MMU).
The mmu notifiers allow kvm/GRU/XPMEM to attach to the tsk->mm and know
when the VM is swapping or freeing or doing anything on the primary MMU so
that the secondary MMU code can drop sptes before the pages are freed,
avoiding all page pinning and allowing 100% reliable swapping of guest
physical address space.  Furthermore it avoids the code that teardown the
mappings of the secondary MMU, to implement a logic like tlb_gather in
zap_page_range that would require many IPI to flush other cpu tlbs, for
each fixed number of spte unmapped.
To make an example: if what happens on the primary MMU is a protection
downgrade (from writeable to wrprotect) the secondary MMU mappings will be
invalidated, and the next secondary-mmu-page-fault will call
get_user_pages and trigger a do_wp_page through get_user_pages if it
called get_user_pages with write=1, and it'll re-establishing an updated
spte or secondary-tlb-mapping on the copied page.  Or it will setup a
readonly spte or readonly tlb mapping if it's a guest-read, if it calls
get_user_pages with write=0.  This is just an example.
This allows to map any page pointed by any pte (and in turn visible in the
primary CPU MMU), into a secondary MMU (be it a pure tlb like GRU, or an
full MMU with both sptes and secondary-tlb like the shadow-pagetable layer
with kvm), or a remote DMA in software like XPMEM (hence needing of
schedule in XPMEM code to send the invalidate to the remote node, while no
need to schedule in kvm/gru as it's an immediate event like invalidating
primary-mmu pte).
At least for KVM without this patch it's impossible to swap guests
reliably.  And having this feature and removing the page pin allows
several other optimizations that simplify life considerably.
Dependencies:
1) mm_take_all_locks() to register the mmu notifier when the whole VM
   isn't doing anything with "mm".  This allows mmu notifier users to keep
   track if the VM is in the middle of the invalidate_range_begin/end
   critical section with an atomic counter incraese in range_begin and
   decreased in range_end.  No secondary MMU page fault is allowed to map
   any spte or secondary tlb reference, while the VM is in the middle of
   range_begin/end as any page returned by get_user_pages in that critical
   section could later immediately be freed without any further
   ->invalidate_page notification (invalidate_range_begin/end works on
   ranges and ->invalidate_page isn't called immediately before freeing
   the page).  To stop all page freeing and pagetable overwrites the
   mmap_sem must be taken in write mode and all other anon_vma/i_mmap
   locks must be taken too.
2) It'd be a waste to add branches in the VM if nobody could possibly
   run KVM/GRU/XPMEM on the kernel, so mmu notifiers will only enabled if
   CONFIG_KVM=m/y.  In the current kernel kvm won't yet take advantage of
   mmu notifiers, but this already allows to compile a KVM external module
   against a kernel with mmu notifiers enabled and from the next pull from
   kvm.git we'll start using them.  And GRU/XPMEM will also be able to
   continue the development by enabling KVM=m in their config, until they
   submit all GRU/XPMEM GPLv2 code to the mainline kernel.  Then they can
   also enable MMU_NOTIFIERS in the same way KVM does it (even if KVM=n).
   This guarantees nobody selects MMU_NOTIFIER=y if KVM and GRU and XPMEM
   are all =n.
The mmu_notifier_register call can fail because mm_take_all_locks may be
interrupted by a signal and return -EINTR.  Because mmu_notifier_reigster
is used when a driver startup, a failure can be gracefully handled.  Here
an example of the change applied to kvm to register the mmu notifiers.
Usually when a driver startups other allocations are required anyway and
-ENOMEM failure paths exists already.
 struct  kvm *kvm_arch_create_vm(void)
 {
        struct kvm *kvm = kzalloc(sizeof(struct kvm), GFP_KERNEL);
+       int err;
        if (!kvm)
                return ERR_PTR(-ENOMEM);
        INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);
+       kvm->arch.mmu_notifier.ops = &kvm_mmu_notifier_ops;
+       err = mmu_notifier_register(&kvm->arch.mmu_notifier, current->mm);
+       if (err) {
+               kfree(kvm);
+               return ERR_PTR(err);
+       }
+
        return kvm;
 }
mmu_notifier_unregister returns void and it's reliable.
The patch also adds a few needed but missing includes that would prevent
kernel to compile after these changes on non-x86 archs (x86 didn't need
them by luck).
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix mm/filemap_xip.c build]
[akpm@linux-foundation.org: fix mm/mmu_notifier.c build]
Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Jack Steiner <steiner@sgi.com>
Cc: Robin Holt <holt@sgi.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Kanoj Sarcar <kanojsarcar@yahoo.com>
Cc: Roland Dreier <rdreier@cisco.com>
Cc: Steve Wise <swise@opengridcomputing.com>
Cc: Avi Kivity <avi@qumranet.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Anthony Liguori <aliguori@us.ibm.com>
Cc: Chris Wright <chrisw@redhat.com>
Cc: Marcelo Tosatti <marcelo@kvack.org>
Cc: Eric Dumazet <dada1@cosmosbay.com>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Cc: Izik Eidus <izike@qumranet.com>
Cc: Anthony Liguori <aliguori@us.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2008-07-28 15:46:29 -07:00
										 |  |  | config MMU_NOTIFIER | 
					
						
							|  |  |  | 	bool | 
					
						
							| 
									
										
										
										
											2009-05-06 16:03:05 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2009-06-03 16:04:31 -04:00
										 |  |  | config DEFAULT_MMAP_MIN_ADDR | 
					
						
							|  |  |  |         int "Low address space to protect from user allocation" | 
					
						
							|  |  |  |         default 4096 | 
					
						
							|  |  |  |         help | 
					
						
							|  |  |  | 	  This is the portion of low virtual memory which should be protected | 
					
						
							|  |  |  | 	  from userspace allocation.  Keeping a user from writing to low pages | 
					
						
							|  |  |  | 	  can help reduce the impact of kernel NULL pointer bugs. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	  For most ia64, ppc64 and x86 users with lots of address space | 
					
						
							|  |  |  | 	  a value of 65536 is reasonable and should cause no problems. | 
					
						
							|  |  |  | 	  On arm and other archs it should not be higher than 32768. | 
					
						
							|  |  |  | 	  Programs which use vm86 functionality would either need additional | 
					
						
							|  |  |  | 	  permissions from either the LSM or the capabilities module or have | 
					
						
							|  |  |  | 	  this protection disabled. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	  This value can be changed after boot using the | 
					
						
							|  |  |  | 	  /proc/sys/vm/mmap_min_addr tunable. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2009-05-06 16:03:05 -07:00
										 |  |  | config NOMMU_INITIAL_TRIM_EXCESS | 
					
						
							|  |  |  | 	int "Turn on mmap() excess space trimming before booting" | 
					
						
							|  |  |  | 	depends on !MMU | 
					
						
							|  |  |  | 	default 1 | 
					
						
							|  |  |  | 	help | 
					
						
							|  |  |  | 	  The NOMMU mmap() frequently needs to allocate large contiguous chunks | 
					
						
							|  |  |  | 	  of memory on which to store mappings, but it can only ask the system | 
					
						
							|  |  |  | 	  allocator for chunks in 2^N*PAGE_SIZE amounts - which is frequently | 
					
						
							|  |  |  | 	  more than it requires.  To deal with this, mmap() is able to trim off | 
					
						
							|  |  |  | 	  the excess and return it to the allocator. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	  If trimming is enabled, the excess is trimmed off and returned to the | 
					
						
							|  |  |  | 	  system allocator, which can cause extra fragmentation, particularly | 
					
						
							|  |  |  | 	  if there are a lot of transient processes. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	  If trimming is disabled, the excess is kept, but not used, which for | 
					
						
							|  |  |  | 	  long-term mappings means that the space is wasted. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	  Trimming can be dynamically controlled through a sysctl option | 
					
						
							|  |  |  | 	  (/proc/sys/vm/nr_trim_pages) which specifies the minimum number of | 
					
						
							|  |  |  | 	  excess pages there must be before trimming should occur, or zero if | 
					
						
							|  |  |  | 	  no trimming is to occur. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	  This option specifies the initial value of this option.  The default | 
					
						
							|  |  |  | 	  of 1 says that all excess pages should be trimmed. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	  See Documentation/nommu-mmap.txt for more information. |