summaryrefslogtreecommitdiff
path: root/include/linux/bootmem.h
AgeCommit message (Collapse)Author
2012-07-11mm: sparse: fix usemap allocation above node descriptor sectionYinghai Lu
After commit f5bf18fa22f8 ("bootmem/sparsemem: remove limit constraint in alloc_bootmem_section"), usemap allocations may easily be placed outside the optimal section that holds the node descriptor, even if there is space available in that section. This results in unnecessary hotplug dependencies that need to have the node unplugged before the section holding the usemap. The reason is that the bootmem allocator doesn't guarantee a linear search starting from the passed allocation goal but may start out at a much higher address absent an upper limit. Fix this by trying the allocation with the limit at the section end, then retry without if that fails. This keeps the fix from f5bf18fa22f8 of not panicking if the allocation does not fit in the section, but still makes sure to try to stay within the section at first. Signed-off-by: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: <stable@vger.kernel.org> [3.3.x, 3.4.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-29mm: remove sparsemem allocation details from the bootmem allocatorJohannes Weiner
alloc_bootmem_section() derives allocation area constraints from the specified sparsemem section. This is a bit specific for a generic memory allocator like bootmem, though, so move it over to sparsemem. As __alloc_bootmem_node_nopanic() already retries failed allocations with relaxed area constraints, the fallback code in sparsemem.c can be removed and the code becomes a bit more compact overall. [akpm@linux-foundation.org: fix build] Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: David S. Miller <davem@davemloft.net> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Gavin Shan <shangw@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-24mm: add a low limit to alloc_large_system_hashTim Bird
UDP stack needs a minimum hash size value for proper operation and also uses alloc_large_system_hash() for proper NUMA distribution of its hash tables and automatic sizing depending on available system memory. On some low memory situations, udp_table_init() must ignore the alloc_large_system_hash() result and reallocs a bigger memory area. As we cannot easily free old hash table, we leak it and kmemleak can issue a warning. This patch adds a low limit parameter to alloc_large_system_hash() to solve this problem. We then specify UDP_HTABLE_SIZE_MIN for UDP/UDPLite hash table allocation. Reported-by: Mark Asselstine <mark.asselstine@windriver.com> Reported-by: Tim Bird <tim.bird@am.sony.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-07-14memblock, x86: Make free_all_memory_core_early() explicitly free lowmem onlyTejun Heo
nomemblock is currently used only by x86 and on x86_32 free_all_memory_core_early() silently freed only the low mem because get_free_all_memory_range() in arch/x86/mm/memblock.c implicitly limited range to max_low_pfn. Rename free_all_memory_core_early() to free_low_memory_core_early() and make it call __get_free_all_memory_range() and limit the range to max_low_pfn explicitly. This makes things clearer and also is consistent with the bootmem behavior. This leaves get_free_all_memory_range() without any user. Kill it. Signed-off-by: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/1310462166-31469-9-git-send-email-tj@kernel.org Cc: Yinghai Lu <yinghai@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2011-05-25memblock/nobootmem: allow alloc_bootmem() to take 0 as low limitYinghai Lu
The bootmem wrapper with memblock supports top-down now, so we do not need to set the low limit to __pa(MAX_DMA_ADDRESS). The logic should be: good to allocate above __pa(MAX_DMA_ADDRESS), but it is ok if we can not find memory above 16M on system that has a small amount of RAM. Signed-off-by: Yinghai LU <yinghai@kernel.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Olaf Hering <olaf@aepfle.de> Cc: Tejun Heo <tj@kernel.org> Cc: Lucas De Marchi <lucas.demarchi@profusion.mobi> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-11mm: use alloc_bootmem_node_nopanic() on really needed pathYinghai Lu
Stefan found nobootmem does not work on his system that has only 8M of RAM. This causes an early panic: BIOS-provided physical RAM map: BIOS-88: 0000000000000000 - 000000000009f000 (usable) BIOS-88: 0000000000100000 - 0000000000840000 (usable) bootconsole [earlyser0] enabled Notice: NX (Execute Disable) protection missing in CPU or disabled in BIOS! DMI not present or invalid. last_pfn = 0x840 max_arch_pfn = 0x100000 init_memory_mapping: 0000000000000000-0000000000840000 8MB LOWMEM available. mapped low ram: 0 - 00840000 low ram: 0 - 00840000 Zone PFN ranges: DMA 0x00000001 -> 0x00001000 Normal empty Movable zone start PFN for each node early_node_map[2] active PFN ranges 0: 0x00000001 -> 0x0000009f 0: 0x00000100 -> 0x00000840 BUG: Int 6: CR2 (null) EDI c034663c ESI (null) EBP c0329f38 ESP c0329ef4 EBX c0346380 EDX 00000006 ECX ffffffff EAX fffffff4 err (null) EIP c0353191 CS c0320060 flg 00010082 Stack: (null) c030c533 000007cd (null) c030c533 00000001 (null) (null) 00000003 0000083f 00000018 00000002 00000002 c0329f6c c03534d6 (null) (null) 00000100 00000840 (null) c0329f64 00000001 00001000 (null) Pid: 0, comm: swapper Not tainted 2.6.36 #5 Call Trace: [<c02e3707>] ? 0xc02e3707 [<c035e6e5>] 0xc035e6e5 [<c0353191>] ? 0xc0353191 [<c03534d6>] 0xc03534d6 [<c034f1cd>] 0xc034f1cd [<c034a824>] 0xc034a824 [<c03513cb>] ? 0xc03513cb [<c0349432>] 0xc0349432 [<c0349066>] 0xc0349066 It turns out that we should ignore the low limit of 16M. Use alloc_bootmem_node_nopanic() in this case. [akpm@linux-foundation.org: less mess] Signed-off-by: Yinghai LU <yinghai@kernel.org> Reported-by: Stefan Hellermann <stefan@the2masters.de> Tested-by: Stefan Hellermann <stefan@the2masters.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@linux.intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: <stable@kernel.org> [2.6.34+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23crash_dump: export is_kdump_kernel to modules, consolidate elfcorehdr_addr, ↵Olaf Hering
setup_elfcorehdr and saved_max_pfn The Xen PV drivers in a crashed HVM guest can not connect to the dom0 backend drivers because both frontend and backend drivers are still in connected state. To run the connection reset function only in case of a crashdump, the is_kdump_kernel() function needs to be available for the PV driver modules. Consolidate elfcorehdr_addr, setup_elfcorehdr and saved_max_pfn into kernel/crash_dump.c Also export elfcorehdr_addr to make is_kdump_kernel() usable for modules. Leave 'elfcorehdr' as early_param(). This changes powerpc from __setup() to early_param(). It adds an address range check from x86 also on ia64 and powerpc. [akpm@linux-foundation.org: additional #includes] [akpm@linux-foundation.org: remove elfcorehdr_addr export] [akpm@linux-foundation.org: fix for Tejun's mm/nobootmem.c changes] Signed-off-by: Olaf Hering <olaf@aepfle.de> Cc: Russell King <rmk@arm.linux.org.uk> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-12-13bootmem: Add alloc_bootmem_align()Suresh Siddha
Add an alloc_bootmem_align() interface to allocate bootmem with specified alignment. This is necessary to be able to allocate the xsave area in a subsequent patch. Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> LKML-Reference: <20101116212441.977574826@sbsiddha-MOBL3.sc.intel.com> Acked-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Cc: <stable@kernel.org>
2010-02-12x86: Make 64 bit use early_res instead of bootmem before slabYinghai Lu
Finally we can use early_res to replace bootmem for x86_64 now. Still can use CONFIG_NO_BOOTMEM to enable it or not. -v2: fix 32bit compiling about MAX_DMA32_PFN -v3: folded bug fix from LKML message below Signed-off-by: Yinghai Lu <yinghai@kernel.org> LKML-Reference: <4B747239.4070907@kernel.org> Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2009-11-10bootmem: Add free_bootmem_late()FUJITA Tomonori
Add a new function for freeing bootmem after the bootmem allocator has been released and the unreserved pages given to the page allocator. This allows us to reserve bootmem and then release it if we later discover it was not needed. ( This new API will be used by the swiotlb code to recover a significant amount of RAM (64MB). ) Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Cc: chrisw@sous-sol.org Cc: dwmw2@infradead.org Cc: joerg.roedel@amd.com Cc: muli@il.ibm.com Cc: hannes@cmpxchg.org Cc: tj@kernel.org Cc: akpm@linux-foundation.org Cc: Linus Torvalds <torvalds@linux-foundation.org> LKML-Reference: <1257849980-22640-7-git-send-email-fujita.tomonori@lab.ntt.co.jp> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-09-22mm: also use alloc_large_system_hash() for the PID hash tableJan Beulich
This is being done by allowing boot time allocations to specify that they may want a sub-page sized amount of memory. Overall this seems more consistent with the other hash table allocations, and allows making two supposedly mm-only variables really mm-only (nr_{kernel,all}_pages). Signed-off-by: Jan Beulich <jbeulich@novell.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-01mm: enable hashdist by default on 64bit NUMAAnton Blanchard
On PowerPC we allocate large boot time hashes on node 0. This leads to an imbalance in the free memory, for example on a 64GB box (4 x 16GB nodes): Free memory: Node 0: 97.03% Node 1: 98.54% Node 2: 98.42% Node 3: 98.53% If we switch to using vmalloc (like ia64 and x86-64) things are more balanced: Free memory: Node 0: 97.53% Node 1: 98.35% Node 2: 98.33% Node 3: 98.33% For many HPC applications we are limited by the free available memory on the smallest node, so even though the same amount of memory is used the better balancing helps. Since all 64bit NUMA capable architectures should have sufficient vmalloc space, it makes sense to enable it via CONFIG_64BIT. Signed-off-by: Anton Blanchard <anton@samba.org> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Acked-by: Ralf Baechle <ralf@linux-mips.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Richard Henderson <rth@twiddle.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-24bootmem: reorder interface functions and add a missing oneTejun Heo
Impact: cleanup and addition of missing interface wrapper The interface functions in bootmem.h was ordered in not so orderly manner. Reorder them such that * functions allocating the same area group together - ie. alloc_bootmem group and alloc_bootmem_low group. * functions w/o node parameter come before the ones w/ node parameter. * nopanic variants are immediately below their panicky counterparts. While at it, add alloc_bootmem_pages_node_nopanic() which was missing. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Johannes Weiner <hannes@saeurebad.de>
2009-02-24bootmem: clean up arch-specific bootmem wrappingTejun Heo
Impact: cleaner and consistent bootmem wrapping By setting CONFIG_HAVE_ARCH_BOOTMEM_NODE, archs can define arch-specific wrappers for bootmem allocation. However, this is done a bit strangely in that only the high level convenience macros can be changed while lower level, but still exported, interface functions can't be wrapped. This not only is messy but also leads to strange situation where alloc_bootmem() does what the arch wants it to do but the equivalent __alloc_bootmem() call doesn't although they should be able to be used interchangeably. This patch updates bootmem such that archs can override / wrap the backend function - alloc_bootmem_core() instead of the highlevel interface functions to allow simpler and consistent wrapping. Also, HAVE_ARCH_BOOTMEM_NODE is renamed to HAVE_ARCH_BOOTMEM. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Johannes Weiner <hannes@saeurebad.de>
2008-08-12page allocator: use no-panic variant of alloc_bootmem() in ↵Jan Beulich
alloc_large_system_hash() .. since a failed allocation is being (initially) handled gracefully, and panic()-ed upon failure explicitly in the function if retries with smaller sizes failed. Signed-off-by: Jan Beulich <jbeulich@novell.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-25bootmem: Move node allocation macros back to !HAVE_ARCH_BOOTMEM_NODEJohannes Weiner
These got unintentionally moved, put them back as x86 provides its own versions. Signed-off-by: Johannes Weiner <hannes@saeurebad.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24bootmem: replace node_boot_start in struct bootmem_dataJohannes Weiner
Almost all users of this field need a PFN instead of a physical address, so replace node_boot_start with node_min_pfn. [Lee.Schermerhorn@hp.com: fix spurious BUG_ON() in mark_bootmem()] Signed-off-by: Johannes Weiner <hannes@saeureba.de> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24bootmem: clean up alloc_bootmem_coreJohannes Weiner
alloc_bootmem_core has become quite nasty to read over time. This is a clean rewrite that keeps the semantics. bdata->last_pos has been dropped. bdata->last_success has been renamed to hint_idx and it is now an index relative to the node's range. Since further block searching might start at this index, it is now set to the end of a succeeded allocation rather than its beginning. bdata->last_offset has been renamed to last_end_off to be more clear that it represents the ending address of the last allocation relative to the node. [y-goto@jp.fujitsu.com: fix new alloc_bootmem_core()] Signed-off-by: Johannes Weiner <hannes@saeurebad.de> Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24bootmem: reorder code to match new bootmem structureJohannes Weiner
This only reorders functions so that further patches will be easier to read. No code changed. Signed-off-by: Johannes Weiner <hannes@saeurebad.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24mm: introduce non panic alloc_bootmemAndi Kleen
Straight forward variant of the existing __alloc_bootmem_node, only subsequent patch when allocating giant hugepages at boot -- don't want to panic if we can't allocate as many as the user asked for. Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24mm: unexport __alloc_bootmem_core()Johannes Weiner
This function has no external callers, so unexport it. Also fix its naming inconsistency. Signed-off-by: Johannes Weiner <hannes@saeurebad.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: Yinghai Lu <yhlu.kernel@gmail.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Andy Whitcroft <apw@shadowen.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24mm: move bootmem descriptors definition to a single placeJohannes Weiner
There are a lot of places that define either a single bootmem descriptor or an array of them. Use only one central array with MAX_NUMNODES items instead. Signed-off-by: Johannes Weiner <hannes@saeurebad.de> Acked-by: Ralf Baechle <ralf@linux-mips.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Richard Henderson <rth@twiddle.net> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Tony Luck <tony.luck@intel.com> Cc: Hirokazu Takata <takata@linux-m32r.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Kyle McMartin <kyle@parisc-linux.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Paul Mundt <lethal@linux-sh.org> Cc: David S. Miller <davem@davemloft.net> Cc: Yinghai Lu <yhlu.kernel@gmail.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-08x86: clean up reserve_bootmem_generic() and port it to 32-bitYinghai Lu
1. add reserve_bootmem_generic for 32bit 2. change len to unsigned long 3. make early_res_to_bootmem to use it Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-21Add return value to reserve_bootmem_node()Bernhard Walle
This patch changes the function reserve_bootmem_node() from void to int, returning -ENOMEM if the allocation fails. This fixes a build problem on x86 with CONFIG_KEXEC=y and CONFIG_NEED_MULTIPLE_NODES=y Signed-off-by: Bernhard Walle <bwalle@suse.de> Reported-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28memory hotplug: make alloc_bootmem_section()Yasunori Goto
alloc_bootmem_section() can allocate specified section's area. This is used for usemap to keep same section with pgdat by later patch. Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: Badari Pulavarty <pbadari@us.ibm.com> Cc: Yinghai Lu <yhlu.kernel@gmail.com> Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-07Introduce flags for reserve_bootmem()Bernhard Walle
This patchset adds a flags variable to reserve_bootmem() and uses the BOOTMEM_EXCLUSIVE flag in crashkernel reservation code to detect collisions between crashkernel area and already used memory. This patch: Change the reserve_bootmem() function to accept a new flag BOOTMEM_EXCLUSIVE. If that flag is set, the function returns with -EBUSY if the memory already has been reserved in the past. This is to avoid conflicts. Because that code runs before SMP initialisation, there's no race condition inside reserve_bootmem_core(). [akpm@linux-foundation.org: coding-style fixes] [akpm@linux-foundation.org: fix powerpc build] Signed-off-by: Bernhard Walle <bwalle@suse.de> Cc: <linux-arch@vger.kernel.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Vivek Goyal <vgoyal@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-29Revert "x86_64: allocate sparsemem memmap above 4G"Linus Torvalds
This reverts commit 2e1c49db4c640b35df13889b86b9d62215ade4b6. First off, testing in Fedora has shown it to cause boot failures, bisected down by Martin Ebourne, and reported by Dave Jobes. So the commit will likely be reverted in the 2.6.23 stable kernels. Secondly, in the 2.6.24 model, x86-64 has now grown support for SPARSEMEM_VMEMMAP, which disables the relevant code anyway, so while the bug is not visible any more, it's become invisible due to the code just being irrelevant and no longer enabled on the only architecture that this ever affected. Reported-by: Dave Jones <davej@redhat.com> Tested-by: Martin Ebourne <fedora@ebourne.me.uk> Cc: Zou Nan hai <nanhai.zou@intel.com> Cc: Suresh Siddha <suresh.b.siddha@intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Acked-by: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-01x86_64: allocate sparsemem memmap above 4GZou Nan hai
On systems with huge amount of physical memory, VFS cache and memory memmap may eat all available system memory under 4G, then the system may fail to allocate swiotlb bounce buffer. There was a fix for this issue in arch/x86_64/mm/numa.c, but that fix dose not cover sparsemem model. This patch add fix to sparsemem model by first try to allocate memmap above 4G. Signed-off-by: Zou Nan hai <nanhai.zou@intel.com> Acked-by: Suresh Siddha <suresh.b.siddha@intel.com> Cc: Andi Kleen <ak@suse.de> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-02[PATCH] x86-64: Set HASHDIST_DEFAULT to 1 for x86_64 NUMARavikiran G Thirumalai
Enable system hashtable memory to be distributed among nodes on x86_64 NUMA Forcing the kernel to use node interleaved vmalloc instead of bootmem for the system hashtable memory (alloc_large_system_hash) reduces the memory imbalance on node 0 by around 40MB on a 8 node x86_64 NUMA box: Before the following patch, on bootup of a 8 node box: Node 0 MemTotal: 3407488 kB Node 0 MemFree: 3206296 kB Node 0 MemUsed: 201192 kB Node 0 Active: 7012 kB Node 0 Inactive: 512 kB Node 0 Dirty: 0 kB Node 0 Writeback: 0 kB Node 0 FilePages: 1912 kB Node 0 Mapped: 420 kB Node 0 AnonPages: 5612 kB Node 0 PageTables: 468 kB Node 0 NFS_Unstable: 0 kB Node 0 Bounce: 0 kB Node 0 Slab: 5408 kB Node 0 SReclaimable: 644 kB Node 0 SUnreclaim: 4764 kB After the patch (or using hashdist=1 on the kernel command line): Node 0 MemTotal: 3407488 kB Node 0 MemFree: 3247608 kB Node 0 MemUsed: 159880 kB Node 0 Active: 3012 kB Node 0 Inactive: 616 kB Node 0 Dirty: 0 kB Node 0 Writeback: 0 kB Node 0 FilePages: 2424 kB Node 0 Mapped: 380 kB Node 0 AnonPages: 1200 kB Node 0 PageTables: 396 kB Node 0 NFS_Unstable: 0 kB Node 0 Bounce: 0 kB Node 0 Slab: 6304 kB Node 0 SReclaimable: 1596 kB Node 0 SUnreclaim: 4708 kB I guess it is a good idea to keep HASHDIST_DEFAULT "on" for x86_64 NUMA since x86_64 has no dearth of vmalloc space? Or maybe enable hash distribution for all 64bit NUMA arches? The following patch does it only for x86_64. I ran a HPC MPI benchmark -- 'Ansys wingsolid', which takes up quite a bit of memory and uses up tlb entries. This was on a 4 way, 2 socket Tyan AMD box (non vsmp), with 8G total memory (4G pernode). The results with and without hash distribution are: 1. Vanilla - runtime of 1188.000s 2. With hashdist=1 runtime of 1154.000s Oprofile output for the duration of run is: 1. Vanilla: PU: AMD64 processors, speed 2411.16 MHz (estimated) Counted L1_AND_L2_DTLB_MISSES events (L1 and L2 DTLB misses) with a unit mask of 0x00 (No unit mask) count 500 samples % app name symbol name 163054 6.5513 libansys1.so MultiFront::decompose(int, int, Elemset *, int *, int, int, int) 162061 6.5114 libansys3.so blockSaxpy6L_fd 162042 6.5107 libansys3.so blockInnerProduct6L_fd 156286 6.2794 libansys3.so maxb33_ 87879 3.5309 libansys1.so elmatrixmultpcg_ 84857 3.4095 libansys4.so saxpy_pcg 58637 2.3560 libansys4.so .st4560 46612 1.8728 libansys4.so .st4282 43043 1.7294 vmlinux-t copy_user_generic_string 41326 1.6604 libansys3.so blockSaxpyBackSolve6L_fd 41288 1.6589 libansys3.so blockInnerProductBackSolve6L_fd 2. With hashdist=1 CPU: AMD64 processors, speed 2411.13 MHz (estimated) Counted L1_AND_L2_DTLB_MISSES events (L1 and L2 DTLB misses) with a unit mask of 0x00 (No unit mask) count 500 samples % app name symbol name 162993 6.9814 libansys1.so MultiFront::decompose(int, int, Elemset *, int *, int, int, int) 160799 6.8874 libansys3.so blockInnerProduct6L_fd 160459 6.8729 libansys3.so blockSaxpy6L_fd 156018 6.6826 libansys3.so maxb33_ 84700 3.6279 libansys4.so saxpy_pcg 83434 3.5737 libansys1.so elmatrixmultpcg_ 58074 2.4875 libansys4.so .st4560 46000 1.9703 libansys4.so .st4282 41166 1.7632 libansys3.so blockSaxpyBackSolve6L_fd 41033 1.7575 libansys3.so blockInnerProductBackSolve6L_fd 35762 1.5318 libansys1.so inner_product_sub 35591 1.5245 libansys1.so inner_product_sub2 28259 1.2104 libansys4.so addVectors Signed-off-by: Pravin B. Shelar <pravin.shelar@calsoftinc.com> Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org> Signed-off-by: Shai Fultheim <shai@scalex86.org> Signed-off-by: Andi Kleen <ak@suse.de> Acked-by: Christoph Lameter <clameter@engr.sgi.com> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2007-03-22[PATCH] FRV: fix unannotated variable declarationsDavid Howells
Fix unannotated variable declarations. Variables that have allocation section annotations (such as __meminitdata) on their definitions must also have them on their declarations as not doing so may affect the addressing mode used by the compiler and may result in a linker error. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Yasunori Goto <y-goto@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2006-12-07[PATCH] remove HASH_HIGHMEMAndrew Morton
It has no users and it's doubtful that we'll need it again. Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26[PATCH] bootmem: miscellaneous coding style fixesFranck Bui-Huu
It fixes various coding style issues, specially when spaces are useless. For example '*' go next to the function name. Signed-off-by: Franck Bui-Huu <vagabon.xyz@gmail.com> Cc: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26[PATCH] bootmem: remove useless headers inclusionsFranck Bui-Huu
Signed-off-by: Franck Bui-Huu <vagabon.xyz@gmail.com> Cc: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26[PATCH] bootmem: limit to 80 columns widthFranck Bui-Huu
Signed-off-by: Franck Bui-Huu <vagabon.xyz@gmail.com> Cc: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26[PATCH] bootmem: remove useless parentheses in bootmem header fileFranck Bui-Huu
Signed-off-by: Franck Bui-Huu <vagabon.xyz@gmail.com> Cc: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26[PATCH] bootmem: remove useless __init in header fileFranck Bui-Huu
__init in headers is pretty useless because the compiler doesn't check it, and they get out of sync relatively frequently. So if you see an __init in a header file, it's quite unreliable and you need to check the definition anyway. Signed-off-by: Franck Bui-Huu <vagabon.xyz@gmail.com> Cc: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-22[XFRM]: Dynamic xfrm_state hash table sizing.David S. Miller
The grow algorithm is simple, we grow if: 1) we see a hash chain collision at insert, and 2) we haven't hit the hash size limit (currently 1*1024*1024 slots), and 3) the number of xfrm_state objects is > the current hash mask All of this needs some tweaking. Remove __initdata from "hashdist" so we can use it safely at run time. Signed-off-by: David S. Miller <davem@davemloft.net>
2006-07-10[PATCH] FRV: Fix FRV arch compile errorsDavid Howells
Fix some FRV arch compile errors, including: (*) Marking nr_kernel_pages as __meminitdata so that references to it end up being properly calculated rather than being assumed to be in the small data section (and thus calculated wrt the GP register). Not doing this causes the linker to emit errors as the offset is too big to fit into the load instruction. (*) Move pm_power_off into an unconditionally compiled .c file as it's now unconditionally accessed. (*) Declare frv_change_cmode() in a header file rather than in a .c file, and declare it asmlinkage. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23[PATCH] wait_table and zonelist initializing for memory hotadd: change to ↵Yasunori Goto
meminit for build_zonelist Change definitions of some functions and data from __init to __meminit. These functions and data can be used after bootup by this patch to be used for hot-add codes. Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-09[PATCH] x86_64: Handle empty PXMs that only contain hotplug memoryAndi Kleen
The node setup code would try to allocate the node metadata in the node itself, but that fails if there is no memory in there. This can happen with memory hotplug when the hotplug area defines an so far empty node. Now use bootmem to try to allocate the mem_map in other nodes. And if it fails don't panic, but just ignore the node. To make this work I added a new __alloc_bootmem_nopanic function that does what its name implies. TBD should try to use nearby nodes here. Currently we just use any. It's hard to do it better because bootmem doesn't have proper fallback lists yet. Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-27[PATCH] for_each_online_pgdat: for_each_bootmemKAMEZAWA Hiroyuki
Add a list_head to bootmem_data_t and make bootmems use it. bootmem list is sorted by node_boot_start. Only nodes against which init_bootmem() is called are linked to the list. (i386 allocates bootmem only from one node(0) not from all online nodes.) A summary: 1. for_each_online_pgdat() traverses all *online* nodes. 2. alloc_bootmem() allocates memory only from initialized-for-bootmem nodes. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-25[PATCH] x86_64: Try to allocate node memmap near the end of nodeAndi Kleen
This fixes problems with very large nodes (over 128GB) filling up all of the first 4GB with their mem_map and not leaving enough space for the swiotlb. Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06[PATCH] Cleanup bootmem allocator and fix alloc_bootmem_lowRavikiran G Thirumalai
Patch cleans up the alloc_bootmem fix for swiotlb. Patch removes alloc_bootmem_*_limit api and fixes alloc_boot_*low api to do the right thing -- allocate from low32 memory. Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-19[PATCH] swiotlb: make sure initial DMA allocations really are in DMA memoryYasunori Goto
This introduces a limit parameter to the core bootmem allocator; The new parameter indicates that physical memory allocated by the bootmem allocator should be within the requested limit. We also introduce alloc_bootmem_low_pages_limit, alloc_bootmem_node_limit, alloc_bootmem_low_pages_node_limit apis, but alloc_bootmem_low_pages_limit is the only api used for swiotlb. The existing alloc_bootmem_low_pages() api could instead have been changed and made to pass right limit to the core allocator. But that would make the patch more intrusive for 2.6.14, as other arches use alloc_bootmem_low_pages(). We may be done that post 2.6.14 as a cleanup. With this, swiotlb gets memory within 4G for both x86_64 and ia64 arches. Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: Ravikiran G Thirumalai <kiran@scalex86.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25[PATCH] kdump: Retrieve saved max pfnVivek Goyal
This patch retrieves the max_pfn being used by previous kernel and stores it in a safe location (saved_max_pfn) before it is overwritten due to user defined memory map. This pfn is used to make sure that user does not try to read the physical memory beyond saved_max_pfn. Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-23[PATCH] sparsemem base: simple NUMA remap space allocatorDave Hansen
Introduce a simple allocator for the NUMA remap space. This space is very scarce, used for structures which are best allocated node local. This mechanism is also used on non-NUMA ia64 systems with a vmem_map to keep the pgdat->node_mem_map initialized in a consistent place for all architectures. Issues: o alloc_remap takes a node_id where we might expect a pgdat which was intended to allow us to allocate the pgdat's using this mechanism; which we do not yet do. Could have alloc_remap_node() and alloc_remap_nid() for this purpose. Signed-off-by: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-04[PATCH] frv: add initdata variable spec in a header fileDavid Howells
The attached patch marks a variable as __initdata in a header file so that the FRV gcc generates the correct access method as initdata variables are too far from the GPREL pointer to access directly. Signed-Off-By: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-03[PATCH] alloc_large_system_hash: NUMA interleavingBrent Casavant
NUMA systems running current Linux kernels suffer from substantial inequities in the amount of memory allocated from each NUMA node during boot. In particular, several large hashes are allocated using alloc_bootmem, and as such are allocated contiguously from a single node each. This becomes a problem for certain workloads that are relatively common on big-iron HPC NUMA systems. In particular, a number of MPI and OpenMP applications which require nearly all available processors in the system and nearly all the memory on each node run into difficulties. Due to the uneven memory distribution onto a few nodes, any thread on those nodes will require a portion of its memory be allocated from remote nodes. Any access to those memory locations will be slower than local accesses, and thereby slows down the effective computation rate for the affected CPUs/threads. This problem is further amplified if the application is tightly synchronized between threads (as is often the case), as they entire job can run only at the speed of the slowest thread. Additionally since these hashes are usually accessed by all CPUS in the system, the NUMA network link on the node which hosts the hash experiences disproportionate traffic levels, thereby reducing the memory bandwidth available to that node's CPUs, and further penalizing performance of the threads executed thereupon. As such, it is desired to find a way to distribute these large hash allocations more evenly across NUMA nodes. Fortunately current kernels do perform allocation interleaving for vmalloc() during boot, which provides a stepping stone to a solution. This series of patches enables (but does not require) the kernel to allocate several boot time hashes using vmalloc rather than alloc_bootmem, thereby causing the hashes to be interleaved amongst NUMA nodes. In particular the dentry cache, inode cache, TCP ehash, and TCP bhash have been changed to be allocated in this manner. Due to the limited vmalloc space on architectures such as i386, this behavior is turned on by default only for IA64 NUMA systems (though there is no reason other interested architectures could not enable it if desired). Non-IA64 and non-NUMA systems continue to use the existing alloc_bootmem() allocation mechanism. A boot line parameter "hashdist" can be set to override the default behavior. The following two sets of example output show the uneven distribution just after boot, using init=/bin/sh to eliminate as much non-kernel allocation as possible. Without the boot hash distribution patches: Nid MemTotal MemFree MemUsed (in kB) 0 3870656 3697696 172960 1 3882992 3866656 16336 2 3883008 3866784 16224 3 3882992 3866464 16528 4 3883008 3866592 16416 5 3883008 3866720 16288 6 3882992 3342176 540816 7 3883008 3865440 17568 8 3882992 3866560 16432 9 3883008 3866400 16608 10 3882992 3866592 16400 11 3883008 3866400 16608 12 3882992 3866400 16592 13 3883008 3866432 16576 14 3883008 3866528 16480 15 3864768 3848256 16512 ToT 62097440 61152096 945344 Notice that nodes 0 and 6 have a substantially larger memory utilization than all other nodes. With the boot hash distribution patch: Nid MemTotal MemFree MemUsed (in kB) 0 3870656 3789792 80864 1 3882992 3843776 39216 2 3883008 3843808 39200 3 3882992 3843904 39088 4 3883008 3827488 55520 5 3883008 3843712 39296 6 3882992 3843936 39056 7 3883008 3844096 38912 8 3882992 3843712 39280 9 3883008 3844000 39008 10 3882992 3843872 39120 11 3883008 3843872 39136 12 3882992 3843808 39184 13 3883008 3843936 39072 14 3883008 3843712 39296 15 3864768 3825760 39008 ToT 62097440 61413184 684256 While not perfectly even, we can see that there is a substantial improvement in the spread of memory allocated by the kernel during boot. The remaining uneveness may be due in part to further boot time allocations that could be addressed in a similar manner, but some difference is due to the somewhat special nature of node 0 during boot. However the uneveness has fallen to a much more acceptable level (at least to a level that SGI isn't concerned about). The astute reader will also notice that in this example, with this patch approximately 256 MB less memory was allocated during boot. This is due to the size limits of a single vmalloc. More specifically, this is because the automatically computed size of the TCP ehash exceeds the maximum size which a single vmalloc can accomodate. However this is of little practical concern as the vmalloc size limit simply reduces one ridiculously large allocation (512MB) to a slightly less ridiculously large allocation (256MB). In practice machines with large memory configurations are using the thash_entries setting to limit the size of the TCP ehash _much_ lower than either of the automatically computed values. Illustrative of the exceedingly large nature of the automatically computed size, SGI currently recommends that customers boot with thash_entries=2097152, which works out to a 32MB allocation. In any case, setting hashdist=0 will allow for allocations in excess of vmalloc limits, if so desired. Other than the vmalloc limit, great care was taken to ensure that the size of TCP hash allocations was not altered by this patch. Due to slightly different computation techniques between the existing TCP code and alloc_large_system_hash (which is now utilized), some of the magic constants in the TCP hash allocation code were changed. On all sizes of system (128MB through 64GB) that I had access to, the patched code preserves the previous hash size, as long as the vmalloc limit (256MB on IA64) is not encountered. There was concern that changing the TCP-related hashes to use vmalloc space may adversely impact network performance. To this end the netperf set of benchmarks was run. Some individual tests seemed to benefit slightly, some seemed to be harmed slightly, but in all cases the average difference with and without these patches was well within the variabilty I would see from run to run. The following is the overall netperf averages (30 10 second runs each) against an older kernel with these same patches. These tests were run over loopback as GigE results were so inconsistent run to run both with and without these patches that they provided no meaningful comparison that I could discern. I used the same kernel (IA64 generic) for each run, simply varying the new "hashdist" boot parameter to turn on or off the new allocation behavior. In all cases the thash_entries value was manually specified as discussed previously to eliminate any variability that might result from that size difference. HP ZX1, hashdist=0 ================== TCP_RR = 19389 TCP_MAERTS = 6561 TCP_STREAM = 6590 TCP_CC = 9483 TCP_CRR = 8633 HP ZX1, hashdist=1 ================== TCP_RR = 19411 TCP_MAERTS = 6559 TCP_STREAM = 6584 TCP_CC = 9454 TCP_CRR = 8626 SGI Altix, hashdist=0 ===================== TCP_RR = 16871 TCP_MAERTS = 3925 TCP_STREAM = 4055 TCP_CC = 8438 TCP_CRR = 7750 SGI Altix, hashdist=1 ===================== TCP_RR = 17040 TCP_MAERTS = 3913 TCP_STREAM = 4044 TCP_CC = 8367 TCP_CRR = 7538 I believe the TCP_CC and TCP_CRR are the tests most sensitive to this particular change. But again, I want to emphasize that even the differences you see above are _well_ within the variability I saw from run to run of any given test. In addition, Jose Santos at IBM has run specSFS, which has been particularly sensitive to TLB issues, against these patches and saw no performance degredation (differences down in the noise). This patch: Modifies alloc_large_system_hash to enable the use of vmalloc to alleviate boottime allocation imbalances on NUMA systems. Due to limited vmalloc space on some architectures (i.e. x86), the use of vmalloc is enabled by default only on NUMA IA64 kernels. There should be no problem enabling this change for any other interested NUMA architecture. Signed-off-by: Brent Casavant <bcasavan@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-08-30[PATCH] fix PID hash sizingNick Piggin
A 4GB, 4-way Opteron would create the smallest size table (16 entries) because pidhash_init is called before mem_init which is where x86-64 sets up max_pfn. nr_kernel_pages is setup by paging_init, called from setup_arch, which is also where i386 sets up max_pfn. So export nr_kernel_pages, nr_all_pages. Use nr_kernel_pages when sizing the PID hash. This fixes the problem. This also makes the pid hash dependant on the size of ZONE_NORMAL instead of total size of memory. Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-20[PATCH] Permit inode & dentry hash tables to be allocated > MAX_ORDER sizeDavid Howells
Here's a patch to allocate memory for big system hash tables with the bootmem allocator rather than with main page allocator. It is needed for three reasons: (1) So that the size can be bigger than MAX_ORDER. IBM have done some testing on their big PPC64 systems (64GB of RAM) with linux-2.4 and found that they get better performance if the sizes of the inode cache hash, dentry cache hash, buffer head hash and page cache hash are increased beyond MAX_ORDER (order 11). Now the main allocator can't allocate anything larger than MAX_ORDER, but the bootmem allocator can. In 2.6 it appears that only the inode and dentry hashes remain of those four, but there are other hash tables that could use this service. (2) Changing MAX_ORDER appears to have a number of effects beyond just limiting the maximum size that can be allocated in one go. (3) Should someone want a hash table in which each bucket isn't a power of two in size, memory will be wasted as the chunk of memory allocated will be a power of two in size (to hold a power of two number of buckets). On the other hand, using the bootmem allocator means the allocation will only take up sufficient pages to hold it, rather than the next power of two up. Admittedly, this point doesn't apply to the dentry and inode hashes, but it might to another hash table that might want to use this service. I've coelesced the meat of the inode and dentry allocation routines into one such routine in mm/page_alloc.c that the the respective initialisation functions now call before mem_init() is called. This routine gets it's approximation of memory size by counting up the ZONE_NORMAL and ZONE_DMA pages (and ZONE_HIGHMEM if requested) in all the nodes passed to the main allocator by paging_init() (or wherever the arch does it). It does not use max_low_pfn as that doesn't seem to be available on all archs, and it doesn't use num_physpages since that includes highmem pages not available to the kernel for allocating data structures upon - which may not be appropriate when calculating hash table size. On the off chance that the size of each hash bucket may not be exactly a power of two, the routine will only allocate as many pages as is necessary to ensure that the number of buckets is exactly a power of two, rather than allocating the smallest power-of-two sized chunk of memory that will hold the same array of buckets. The maximum size of any single hash table is given by MAX_SYS_HASH_TABLE_ORDER, as is now defined in linux/mmzone.h. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>