summaryrefslogtreecommitdiff
path: root/include/linux/bootmem.h
AgeCommit message (Collapse)Author
2005-01-04[PATCH] frv: add initdata variable spec in a header fileDavid Howells
The attached patch marks a variable as __initdata in a header file so that the FRV gcc generates the correct access method as initdata variables are too far from the GPREL pointer to access directly. Signed-Off-By: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-03[PATCH] alloc_large_system_hash: NUMA interleavingBrent Casavant
NUMA systems running current Linux kernels suffer from substantial inequities in the amount of memory allocated from each NUMA node during boot. In particular, several large hashes are allocated using alloc_bootmem, and as such are allocated contiguously from a single node each. This becomes a problem for certain workloads that are relatively common on big-iron HPC NUMA systems. In particular, a number of MPI and OpenMP applications which require nearly all available processors in the system and nearly all the memory on each node run into difficulties. Due to the uneven memory distribution onto a few nodes, any thread on those nodes will require a portion of its memory be allocated from remote nodes. Any access to those memory locations will be slower than local accesses, and thereby slows down the effective computation rate for the affected CPUs/threads. This problem is further amplified if the application is tightly synchronized between threads (as is often the case), as they entire job can run only at the speed of the slowest thread. Additionally since these hashes are usually accessed by all CPUS in the system, the NUMA network link on the node which hosts the hash experiences disproportionate traffic levels, thereby reducing the memory bandwidth available to that node's CPUs, and further penalizing performance of the threads executed thereupon. As such, it is desired to find a way to distribute these large hash allocations more evenly across NUMA nodes. Fortunately current kernels do perform allocation interleaving for vmalloc() during boot, which provides a stepping stone to a solution. This series of patches enables (but does not require) the kernel to allocate several boot time hashes using vmalloc rather than alloc_bootmem, thereby causing the hashes to be interleaved amongst NUMA nodes. In particular the dentry cache, inode cache, TCP ehash, and TCP bhash have been changed to be allocated in this manner. Due to the limited vmalloc space on architectures such as i386, this behavior is turned on by default only for IA64 NUMA systems (though there is no reason other interested architectures could not enable it if desired). Non-IA64 and non-NUMA systems continue to use the existing alloc_bootmem() allocation mechanism. A boot line parameter "hashdist" can be set to override the default behavior. The following two sets of example output show the uneven distribution just after boot, using init=/bin/sh to eliminate as much non-kernel allocation as possible. Without the boot hash distribution patches: Nid MemTotal MemFree MemUsed (in kB) 0 3870656 3697696 172960 1 3882992 3866656 16336 2 3883008 3866784 16224 3 3882992 3866464 16528 4 3883008 3866592 16416 5 3883008 3866720 16288 6 3882992 3342176 540816 7 3883008 3865440 17568 8 3882992 3866560 16432 9 3883008 3866400 16608 10 3882992 3866592 16400 11 3883008 3866400 16608 12 3882992 3866400 16592 13 3883008 3866432 16576 14 3883008 3866528 16480 15 3864768 3848256 16512 ToT 62097440 61152096 945344 Notice that nodes 0 and 6 have a substantially larger memory utilization than all other nodes. With the boot hash distribution patch: Nid MemTotal MemFree MemUsed (in kB) 0 3870656 3789792 80864 1 3882992 3843776 39216 2 3883008 3843808 39200 3 3882992 3843904 39088 4 3883008 3827488 55520 5 3883008 3843712 39296 6 3882992 3843936 39056 7 3883008 3844096 38912 8 3882992 3843712 39280 9 3883008 3844000 39008 10 3882992 3843872 39120 11 3883008 3843872 39136 12 3882992 3843808 39184 13 3883008 3843936 39072 14 3883008 3843712 39296 15 3864768 3825760 39008 ToT 62097440 61413184 684256 While not perfectly even, we can see that there is a substantial improvement in the spread of memory allocated by the kernel during boot. The remaining uneveness may be due in part to further boot time allocations that could be addressed in a similar manner, but some difference is due to the somewhat special nature of node 0 during boot. However the uneveness has fallen to a much more acceptable level (at least to a level that SGI isn't concerned about). The astute reader will also notice that in this example, with this patch approximately 256 MB less memory was allocated during boot. This is due to the size limits of a single vmalloc. More specifically, this is because the automatically computed size of the TCP ehash exceeds the maximum size which a single vmalloc can accomodate. However this is of little practical concern as the vmalloc size limit simply reduces one ridiculously large allocation (512MB) to a slightly less ridiculously large allocation (256MB). In practice machines with large memory configurations are using the thash_entries setting to limit the size of the TCP ehash _much_ lower than either of the automatically computed values. Illustrative of the exceedingly large nature of the automatically computed size, SGI currently recommends that customers boot with thash_entries=2097152, which works out to a 32MB allocation. In any case, setting hashdist=0 will allow for allocations in excess of vmalloc limits, if so desired. Other than the vmalloc limit, great care was taken to ensure that the size of TCP hash allocations was not altered by this patch. Due to slightly different computation techniques between the existing TCP code and alloc_large_system_hash (which is now utilized), some of the magic constants in the TCP hash allocation code were changed. On all sizes of system (128MB through 64GB) that I had access to, the patched code preserves the previous hash size, as long as the vmalloc limit (256MB on IA64) is not encountered. There was concern that changing the TCP-related hashes to use vmalloc space may adversely impact network performance. To this end the netperf set of benchmarks was run. Some individual tests seemed to benefit slightly, some seemed to be harmed slightly, but in all cases the average difference with and without these patches was well within the variabilty I would see from run to run. The following is the overall netperf averages (30 10 second runs each) against an older kernel with these same patches. These tests were run over loopback as GigE results were so inconsistent run to run both with and without these patches that they provided no meaningful comparison that I could discern. I used the same kernel (IA64 generic) for each run, simply varying the new "hashdist" boot parameter to turn on or off the new allocation behavior. In all cases the thash_entries value was manually specified as discussed previously to eliminate any variability that might result from that size difference. HP ZX1, hashdist=0 ================== TCP_RR = 19389 TCP_MAERTS = 6561 TCP_STREAM = 6590 TCP_CC = 9483 TCP_CRR = 8633 HP ZX1, hashdist=1 ================== TCP_RR = 19411 TCP_MAERTS = 6559 TCP_STREAM = 6584 TCP_CC = 9454 TCP_CRR = 8626 SGI Altix, hashdist=0 ===================== TCP_RR = 16871 TCP_MAERTS = 3925 TCP_STREAM = 4055 TCP_CC = 8438 TCP_CRR = 7750 SGI Altix, hashdist=1 ===================== TCP_RR = 17040 TCP_MAERTS = 3913 TCP_STREAM = 4044 TCP_CC = 8367 TCP_CRR = 7538 I believe the TCP_CC and TCP_CRR are the tests most sensitive to this particular change. But again, I want to emphasize that even the differences you see above are _well_ within the variability I saw from run to run of any given test. In addition, Jose Santos at IBM has run specSFS, which has been particularly sensitive to TLB issues, against these patches and saw no performance degredation (differences down in the noise). This patch: Modifies alloc_large_system_hash to enable the use of vmalloc to alleviate boottime allocation imbalances on NUMA systems. Due to limited vmalloc space on some architectures (i.e. x86), the use of vmalloc is enabled by default only on NUMA IA64 kernels. There should be no problem enabling this change for any other interested NUMA architecture. Signed-off-by: Brent Casavant <bcasavan@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-08-30[PATCH] fix PID hash sizingNick Piggin
A 4GB, 4-way Opteron would create the smallest size table (16 entries) because pidhash_init is called before mem_init which is where x86-64 sets up max_pfn. nr_kernel_pages is setup by paging_init, called from setup_arch, which is also where i386 sets up max_pfn. So export nr_kernel_pages, nr_all_pages. Use nr_kernel_pages when sizing the PID hash. This fixes the problem. This also makes the pid hash dependant on the size of ZONE_NORMAL instead of total size of memory. Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-20[PATCH] Permit inode & dentry hash tables to be allocated > MAX_ORDER sizeDavid Howells
Here's a patch to allocate memory for big system hash tables with the bootmem allocator rather than with main page allocator. It is needed for three reasons: (1) So that the size can be bigger than MAX_ORDER. IBM have done some testing on their big PPC64 systems (64GB of RAM) with linux-2.4 and found that they get better performance if the sizes of the inode cache hash, dentry cache hash, buffer head hash and page cache hash are increased beyond MAX_ORDER (order 11). Now the main allocator can't allocate anything larger than MAX_ORDER, but the bootmem allocator can. In 2.6 it appears that only the inode and dentry hashes remain of those four, but there are other hash tables that could use this service. (2) Changing MAX_ORDER appears to have a number of effects beyond just limiting the maximum size that can be allocated in one go. (3) Should someone want a hash table in which each bucket isn't a power of two in size, memory will be wasted as the chunk of memory allocated will be a power of two in size (to hold a power of two number of buckets). On the other hand, using the bootmem allocator means the allocation will only take up sufficient pages to hold it, rather than the next power of two up. Admittedly, this point doesn't apply to the dentry and inode hashes, but it might to another hash table that might want to use this service. I've coelesced the meat of the inode and dentry allocation routines into one such routine in mm/page_alloc.c that the the respective initialisation functions now call before mem_init() is called. This routine gets it's approximation of memory size by counting up the ZONE_NORMAL and ZONE_DMA pages (and ZONE_HIGHMEM if requested) in all the nodes passed to the main allocator by paging_init() (or wherever the arch does it). It does not use max_low_pfn as that doesn't seem to be available on all archs, and it doesn't use num_physpages since that includes highmem pages not available to the kernel for allocating data structures upon - which may not be appropriate when calculating hash table size. On the off chance that the size of each hash bucket may not be exactly a power of two, the routine will only allocate as many pages as is necessary to ensure that the number of buckets is exactly a power of two, rather than allocating the smallest power-of-two sized chunk of memory that will hold the same array of buckets. The maximum size of any single hash table is given by MAX_SYS_HASH_TABLE_ORDER, as is now defined in linux/mmzone.h. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2003-04-12[PATCH] bootmem speedup from the IA64 treeAndrew Morton
From: Christoph Hellwig <hch@lst.de> This patch is from the IA64 tree, with some minor cleanups by me. David described it as: This is a performance speed up and some minor indendation fixups. The problem is that the bootmem code is (a) hugely slow and (b) has execution that grow quadratically with the size of the bootmap bitmap. This causes noticable slowdowns, especially on machines with (relatively) large holes in the physical memory map. Issue (b) is addressed by maintaining the "last_success" cache, so that we start the next search from the place where we last found some memory (this part of the patch could stand additional reviewing/testing). Issue (a) is addressed by using find_next_zero_bit() instead of the slow bit-by-bit testing.
2002-09-03[PATCH] discontigmem support for ia32 NUMAAndrew Morton
- All the support macros which assume a linear mem_map[] have been wrapped in !CONFIG_DISCONTIGMEM. pfn_to_page, page_to_pfn, page_to_phys, pmd_page, kern_addr_valid. - Move some initialsation macros into setup.h so they can be used in the i386 discontig.c (INITRD_START, INITRD_SIZE). - Alternate version of the bootmem allocator - add i386 discontig support and numaq support.
2002-02-04v2.5.0.1 -> v2.5.0.2Linus Torvalds
- Greg KH: USB update - Richard Gooch: refcounting for devfs - Jens Axboe: start of new block IO layer
2002-02-04Import changesetLinus Torvalds