| Age | Commit message (Collapse) | Author |
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull non-MM updates from Andrew Morton:
- "ocfs2: give ocfs2 the ability to reclaim suballocator free bg" saves
disk space by teaching ocfs2 to reclaim suballocator block group
space (Heming Zhao)
- "Add ARRAY_END(), and use it to fix off-by-one bugs" adds the
ARRAY_END() macro and uses it in various places (Alejandro Colomar)
- "vmcoreinfo: support VMCOREINFO_BYTES larger than PAGE_SIZE" makes
the vmcore code future-safe, if VMCOREINFO_BYTES ever exceeds the
page size (Pnina Feder)
- "kallsyms: Prevent invalid access when showing module buildid" cleans
up kallsyms code related to module buildid and fixes an invalid
access crash when printing backtraces (Petr Mladek)
- "Address page fault in ima_restore_measurement_list()" fixes a
kexec-related crash that can occur when booting the second-stage
kernel on x86 (Harshit Mogalapalli)
- "kho: ABI headers and Documentation updates" updates the kexec
handover ABI documentation (Mike Rapoport)
- "Align atomic storage" adds the __aligned attribute to atomic_t and
atomic64_t definitions to get natural alignment of both types on
csky, m68k, microblaze, nios2, openrisc and sh (Finn Thain)
- "kho: clean up page initialization logic" simplifies the page
initialization logic in kho_restore_page() (Pratyush Yadav)
- "Unload linux/kernel.h" moves several things out of kernel.h and into
more appropriate places (Yury Norov)
- "don't abuse task_struct.group_leader" removes the usage of
->group_leader when it is "obviously unnecessary" (Oleg Nesterov)
- "list private v2 & luo flb" adds some infrastructure improvements to
the live update orchestrator (Pasha Tatashin)
* tag 'mm-nonmm-stable-2026-02-12-10-48' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (107 commits)
watchdog/hardlockup: simplify perf event probe and remove per-cpu dependency
procfs: fix missing RCU protection when reading real_parent in do_task_stat()
watchdog/softlockup: fix sample ring index wrap in need_counting_irqs()
kcsan, compiler_types: avoid duplicate type issues in BPF Type Format
kho: fix doc for kho_restore_pages()
tests/liveupdate: add in-kernel liveupdate test
liveupdate: luo_flb: introduce File-Lifecycle-Bound global state
liveupdate: luo_file: Use private list
list: add kunit test for private list primitives
list: add primitives for private list manipulations
delayacct: fix uapi timespec64 definition
panic: add panic_force_cpu= parameter to redirect panic to a specific CPU
netclassid: use thread_group_leader(p) in update_classid_task()
RDMA/umem: don't abuse current->group_leader
drm/pan*: don't abuse current->group_leader
drm/amd: kill the outdated "Only the pthreads threading model is supported" checks
drm/amdgpu: don't abuse current->group_leader
android/binder: use same_thread_group(proc->tsk, current) in binder_mmap()
android/binder: don't abuse current->group_leader
kho: skip memoryless NUMA nodes when reserving scratch areas
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- "powerpc/64s: do not re-activate batched TLB flush" makes
arch_{enter|leave}_lazy_mmu_mode() nest properly (Alexander Gordeev)
It adds a generic enter/leave layer and switches architectures to use
it. Various hacks were removed in the process.
- "zram: introduce compressed data writeback" implements data
compression for zram writeback (Richard Chang and Sergey Senozhatsky)
- "mm: folio_zero_user: clear page ranges" adds clearing of contiguous
page ranges for hugepages. Large improvements during demand faulting
are demonstrated (David Hildenbrand)
- "memcg cleanups" tidies up some memcg code (Chen Ridong)
- "mm/damon: introduce {,max_}nr_snapshots and tracepoint for damos
stats" improves DAMOS stat's provided information, deterministic
control, and readability (SeongJae Park)
- "selftests/mm: hugetlb cgroup charging: robustness fixes" fixes a few
issues in the hugetlb cgroup charging selftests (Li Wang)
- "Fix va_high_addr_switch.sh test failure - again" addresses several
issues in the va_high_addr_switch test (Chunyu Hu)
- "mm/damon/tests/core-kunit: extend existing test scenarios" improves
the KUnit test coverage for DAMON (Shu Anzai)
- "mm/khugepaged: fix dirty page handling for MADV_COLLAPSE" fixes a
glitch in khugepaged which was causing madvise(MADV_COLLAPSE) to
transiently return -EAGAIN (Shivank Garg)
- "arch, mm: consolidate hugetlb early reservation" reworks and
consolidates a pile of straggly code related to reservation of
hugetlb memory from bootmem and creation of CMA areas for hugetlb
(Mike Rapoport)
- "mm: clean up anon_vma implementation" cleans up the anon_vma
implementation in various ways (Lorenzo Stoakes)
- "tweaks for __alloc_pages_slowpath()" does a little streamlining of
the page allocator's slowpath code (Vlastimil Babka)
- "memcg: separate private and public ID namespaces" cleans up the
memcg ID code and prevents the internal-only private IDs from being
exposed to userspace (Shakeel Butt)
- "mm: hugetlb: allocate frozen gigantic folio" cleans up the
allocation of frozen folios and avoids some atomic refcount
operations (Kefeng Wang)
- "mm/damon: advance DAMOS-based LRU sorting" improves DAMOS's movement
of memory betewwn the active and inactive LRUs and adds auto-tuning
of the ratio-based quotas and of monitoring intervals (SeongJae Park)
- "Support page table check on PowerPC" makes
CONFIG_PAGE_TABLE_CHECK_ENFORCED work on powerpc (Andrew Donnellan)
- "nodemask: align nodes_and{,not} with underlying bitmap ops" makes
nodes_and() and nodes_andnot() propagate the return values from the
underlying bit operations, enabling some cleanup in calling code
(Yury Norov)
- "mm/damon: hide kdamond and kdamond_lock from API callers" cleans up
some DAMON internal interfaces (SeongJae Park)
- "mm/khugepaged: cleanups and scan limit fix" does some cleanup work
in khupaged and fixes a scan limit accounting issue (Shivank Garg)
- "mm: balloon infrastructure cleanups" goes to town on the balloon
infrastructure and its page migration function. Mainly cleanups, also
some locking simplification (David Hildenbrand)
- "mm/vmscan: add tracepoint and reason for kswapd_failures reset" adds
additional tracepoints to the page reclaim code (Jiayuan Chen)
- "Replace wq users and add WQ_PERCPU to alloc_workqueue() users" is
part of Marco's kernel-wide migration from the legacy workqueue APIs
over to the preferred unbound workqueues (Marco Crivellari)
- "Various mm kselftests improvements/fixes" provides various unrelated
improvements/fixes for the mm kselftests (Kevin Brodsky)
- "mm: accelerate gigantic folio allocation" greatly speeds up gigantic
folio allocation, mainly by avoiding unnecessary work in
pfn_range_valid_contig() (Kefeng Wang)
- "selftests/damon: improve leak detection and wss estimation
reliability" improves the reliability of two of the DAMON selftests
(SeongJae Park)
- "mm/damon: cleanup kdamond, damon_call(), damos filter and
DAMON_MIN_REGION" does some cleanup work in the core DAMON code
(SeongJae Park)
- "Docs/mm/damon: update intro, modules, maintainer profile, and misc"
performs maintenance work on the DAMON documentation (SeongJae Park)
- "mm: add and use vma_assert_stabilised() helper" refactors and cleans
up the core VMA code. The main aim here is to be able to use the mmap
write lock's lockdep state to perform various assertions regarding
the locking which the VMA code requires (Lorenzo Stoakes)
- "mm, swap: swap table phase II: unify swapin use" removes some old
swap code (swap cache bypassing and swap synchronization) which
wasn't working very well. Various other cleanups and simplifications
were made. The end result is a 20% speedup in one benchmark (Kairui
Song)
- "enable PT_RECLAIM on more 64-bit architectures" makes PT_RECLAIM
available on 64-bit alpha, loongarch, mips, parisc, and um. Various
cleanups were performed along the way (Qi Zheng)
* tag 'mm-stable-2026-02-11-19-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (325 commits)
mm/memory: handle non-split locks correctly in zap_empty_pte_table()
mm: move pte table reclaim code to memory.c
mm: make PT_RECLAIM depends on MMU_GATHER_RCU_TABLE_FREE
mm: convert __HAVE_ARCH_TLB_REMOVE_TABLE to CONFIG_HAVE_ARCH_TLB_REMOVE_TABLE config
um: mm: enable MMU_GATHER_RCU_TABLE_FREE
parisc: mm: enable MMU_GATHER_RCU_TABLE_FREE
mips: mm: enable MMU_GATHER_RCU_TABLE_FREE
LoongArch: mm: enable MMU_GATHER_RCU_TABLE_FREE
alpha: mm: enable MMU_GATHER_RCU_TABLE_FREE
mm: change mm/pt_reclaim.c to use asm/tlb.h instead of asm-generic/tlb.h
mm/damon/stat: remove __read_mostly from memory_idle_ms_percentiles
zsmalloc: make common caches global
mm: add SPDX id lines to some mm source files
mm/zswap: use %pe to print error pointers
mm/vmscan: use %pe to print error pointers
mm/readahead: fix typo in comment
mm: khugepaged: fix NR_FILE_PAGES and NR_SHMEM in collapse_file()
mm: refactor vma_map_pages to use vm_insert_pages
mm/damon: unify address range representation with damon_addr_range
mm/cma: replace snprintf with strscpy in cma_new_area
...
|
|
Pull fsverity updates from Eric Biggers:
"fsverity cleanups, speedup, and memory usage optimization from
Christoph Hellwig:
- Move some logic into common code
- Fix btrfs to reject truncates of fsverity files
- Improve the readahead implementation
- Store each inode's fsverity_info in a hash table instead of using a
pointer in the filesystem-specific part of the inode.
This optimizes for memory usage in the usual case where most files
don't have fsverity enabled.
- Look up the fsverity_info fewer times during verification, to
amortize the hash table overhead"
* tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fsverity/linux:
fsverity: remove inode from fsverity_verification_ctx
fsverity: use a hashtable to find the fsverity_info
btrfs: consolidate fsverity_info lookup
f2fs: consolidate fsverity_info lookup
ext4: consolidate fsverity_info lookup
fs: consolidate fsverity_info lookup in buffer.c
fsverity: push out fsverity_info lookup
fsverity: deconstify the inode pointer in struct fsverity_info
fsverity: kick off hash readahead at data I/O submission time
ext4: move ->read_folio and ->readahead to readpage.c
readahead: push invalidate_lock out of page_cache_ra_unbounded
fsverity: don't issue readahead for non-ENOENT errors from __filemap_get_folio
fsverity: start consolidating pagecache code
fsverity: pass struct file to ->write_merkle_tree_block
f2fs: don't build the fsverity work handler for !CONFIG_FS_VERITY
ext4: don't build the fsverity work handler for !CONFIG_FS_VERITY
fs,fsverity: clear out fsverity_info from common code
fs,fsverity: reject size changes on fsverity files in setattr_prepare
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux
Pull iommu updates from Joerg Roedel:
"Core changes:
- Rust bindings for IO-pgtable code
- IOMMU page allocation debugging support
- Disable ATS during PCI resets
Intel VT-d changes:
- Skip dev-iotlb flush for inaccessible PCIe device
- Flush cache for PASID table before using it
- Use right invalidation method for SVA and NESTED domains
- Ensure atomicity in context and PASID entry updates
AMD-Vi changes:
- Support for nested translations
- Other minor improvements
ARM-SMMU-v2 changes:
- Configure SoC-specific prefetcher settings for Qualcomm's "MDSS"
ARM-SMMU-v3 changes:
- Improve CMDQ locking fairness for pathetically small queue sizes
- Remove tracking of the IAS as this is only relevant for AArch32 and
was causing C_BAD_STE errors
- Add device-tree support for NVIDIA's CMDQV extension
- Allow some hitless transitions for the 'MEV' and 'EATS' STE fields
- Don't disable ATS for nested S1-bypass nested domains
- Additions to the kunit selftests"
* tag 'iommu-updates-v7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux: (54 commits)
iommupt: Always add IOVA range to iotlb_gather in gather_range_pages()
iommu/amd: serialize sequence allocation under concurrent TLB invalidations
iommu/amd: Fix type of type parameter to amd_iommufd_hw_info()
iommu/arm-smmu-v3: Do not set disable_ats unless vSTE is Translate
iommu/arm-smmu-v3-test: Add nested s1bypass/s1dssbypass coverage
iommu/arm-smmu-v3: Mark EATS_TRANS safe when computing the update sequence
iommu/arm-smmu-v3: Mark STE MEV safe when computing the update sequence
iommu/arm-smmu-v3: Add update_safe bits to fix STE update sequence
iommu/arm-smmu-v3: Add device-tree support for CMDQV driver
iommu/tegra241-cmdqv: Decouple driver from ACPI
iommu/arm-smmu-qcom: Restore ACTLR settings for MDSS on sa8775p
iommu/vt-d: Fix race condition during PASID entry replacement
iommu/vt-d: Clear Present bit before tearing down context entry
iommu/vt-d: Clear Present bit before tearing down PASID entry
iommu/vt-d: Flush piotlb for SVM and Nested domain
iommu/vt-d: Flush cache for PASID table before using it
iommu/vt-d: Flush dev-IOTLB only when PCIe device is accessible in scalable mode
iommu/vt-d: Skip dev-iotlb flush for inaccessible PCIe device without scalable mode
rust: iommu: fix `srctree` link warning
rust: iommu: fix Rust formatting
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab
Pull slab updates from Vlastimil Babka:
- The percpu sheaves caching layer was introduced as opt-in in 6.18 and
now we enable it for all caches and remove the previous cpu (partial)
slab caching mechanism.
Besides the lower locking overhead and much more likely fastpath when
freeing, this removes the rather complicated code related to the cpu
slab lockless fastpaths (using this_cpu_try_cmpxchg128/64) and all
its complications for PREEMPT_RT or kmalloc_nolock().
The lockless slab freelist+counters update operation using
try_cmpxchg128/64 remains and is crucial for freeing remote NUMA
objects, and to allow flushing objects from sheaves to slabs mostly
without the node list_lock (Vlastimil Babka)
- Eliminate slabobj_ext metadata overhead when possible. Instead of
using kmalloc() to allocate the array for memcg and/or allocation
profiling tag pointers, use leftover space in a slab or per-object
padding due to alignment (Harry Yoo)
- Various followup improvements to the above (Hao Li)
* tag 'slab-for-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: (39 commits)
slub: let need_slab_obj_exts() return false if SLAB_NO_OBJ_EXT is set
mm/slab: only allow SLAB_OBJ_EXT_IN_OBJ for unmergeable caches
mm/slab: place slabobj_ext metadata in unused space within s->size
mm/slab: move [__]ksize and slab_ksize() to mm/slub.c
mm/slab: save memory by allocating slabobj_ext array from leftover
mm/memcontrol,alloc_tag: handle slabobj_ext access under KASAN poison
mm/slab: use stride to access slabobj_ext
mm/slab: abstract slabobj_ext access via new slab_obj_ext() helper
ext4: specify the free pointer offset for ext4_inode_cache
mm/slab: allow specifying free pointer offset when using constructor
mm/slab: use unsigned long for orig_size to ensure proper metadata align
slub: clarify object field layout comments
mm/slab: avoid allocating slabobj_ext array from its own slab
slub: avoid list_lock contention from __refill_objects_any()
mm/slub: cleanup and repurpose some stat items
mm/slub: remove DEACTIVATE_TO_* stat items
slab: remove frozen slab checks from __slab_free()
slab: update overview comments
slab: refill sheaves from all nodes
slab: remove unused PREEMPT_RT specific macros
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
- cpuset changes:
- Continue separating v1 and v2 implementations by moving more
v1-specific logic into cpuset-v1.c
- Improve partition handling. Sibling partitions are no longer
invalidated on cpuset.cpus conflict, cpuset.cpus changes no longer
fail in v2, and effective_xcpus computation is made consistent
- Fix partition effective CPUs overlap that caused a warning on
cpuset removal when sibling partitions shared CPUs
- Increase the maximum cgroup subsystem count from 16 to 32 to
accommodate future subsystem additions
- Misc cleanups and selftest improvements including switching to
css_is_online() helper, removing dead code and stale documentation
references, using lockdep_assert_cpuset_lock_held() consistently, and
adding polling helpers for asynchronously updated cgroup statistics
* tag 'cgroup-for-6.20' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (21 commits)
cpuset: fix overlap of partition effective CPUs
cgroup: increase maximum subsystem count from 16 to 32
cgroup: Remove stale cpu.rt.max reference from documentation
cpuset: replace direct lockdep_assert_held() with lockdep_assert_cpuset_lock_held()
cgroup/cpuset: Move the v1 empty cpus/mems check to cpuset1_validate_change()
cgroup/cpuset: Don't invalidate sibling partitions on cpuset.cpus conflict
cgroup/cpuset: Don't fail cpuset.cpus change in v2
cgroup/cpuset: Consistently compute effective_xcpus in update_cpumasks_hier()
cgroup/cpuset: Streamline rm_siblings_excl_cpus()
cpuset: remove dead code in cpuset-v1.c
cpuset: remove v1-specific code from generate_sched_domains
cpuset: separate generate_sched_domains for v1 and v2
cpuset: move update_domain_attr_tree to cpuset_v1.c
cpuset: add cpuset1_init helper for v1 initialization
cpuset: add cpuset1_online_css helper for v1-specific operations
cpuset: add lockdep_assert_cpuset_lock_held helper
cpuset: Remove unnecessary checks in rebuild_sched_domains_locked
cgroup: switch to css_is_online() helper
selftests: cgroup: Replace sleep with cg_read_key_long_poll() for waiting on nr_dying_descendants
selftests: cgroup: make test_memcg_sock robust against delayed sock stats
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Ingo Molnar:
"Lock debugging:
- Implement compiler-driven static analysis locking context checking,
using the upcoming Clang 22 compiler's context analysis features
(Marco Elver)
We removed Sparse context analysis support, because prior to
removal even a defconfig kernel produced 1,700+ context tracking
Sparse warnings, the overwhelming majority of which are false
positives. On an allmodconfig kernel the number of false positive
context tracking Sparse warnings grows to over 5,200... On the plus
side of the balance actual locking bugs found by Sparse context
analysis is also rather ... sparse: I found only 3 such commits in
the last 3 years. So the rate of false positives and the
maintenance overhead is rather high and there appears to be no
active policy in place to achieve a zero-warnings baseline to move
the annotations & fixers to developers who introduce new code.
Clang context analysis is more complete and more aggressive in
trying to find bugs, at least in principle. Plus it has a different
model to enabling it: it's enabled subsystem by subsystem, which
results in zero warnings on all relevant kernel builds (as far as
our testing managed to cover it). Which allowed us to enable it by
default, similar to other compiler warnings, with the expectation
that there are no warnings going forward. This enforces a
zero-warnings baseline on clang-22+ builds (Which are still limited
in distribution, admittedly)
Hopefully the Clang approach can lead to a more maintainable
zero-warnings status quo and policy, with more and more subsystems
and drivers enabling the feature. Context tracking can be enabled
for all kernel code via WARN_CONTEXT_ANALYSIS_ALL=y (default
disabled), but this will generate a lot of false positives.
( Having said that, Sparse support could still be added back,
if anyone is interested - the removal patch is still
relatively straightforward to revert at this stage. )
Rust integration updates: (Alice Ryhl, Fujita Tomonori, Boqun Feng)
- Add support for Atomic<i8/i16/bool> and replace most Rust native
AtomicBool usages with Atomic<bool>
- Clean up LockClassKey and improve its documentation
- Add missing Send and Sync trait implementation for SetOnce
- Make ARef Unpin as it is supposed to be
- Add __rust_helper to a few Rust helpers as a preparation for
helper LTO
- Inline various lock related functions to avoid additional function
calls
WW mutexes:
- Extend ww_mutex tests and other test-ww_mutex updates (John
Stultz)
Misc fixes and cleanups:
- rcu: Mark lockdep_assert_rcu_helper() __always_inline (Arnd
Bergmann)
- locking/local_lock: Include more missing headers (Peter Zijlstra)
- seqlock: fix scoped_seqlock_read kernel-doc (Randy Dunlap)
- rust: sync: Replace `kernel::c_str!` with C-Strings (Tamir
Duberstein)"
* tag 'locking-core-2026-02-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (90 commits)
locking/rwlock: Fix write_trylock_irqsave() with CONFIG_INLINE_WRITE_TRYLOCK
rcu: Mark lockdep_assert_rcu_helper() __always_inline
compiler-context-analysis: Remove __assume_ctx_lock from initializers
tomoyo: Use scoped init guard
crypto: Use scoped init guard
kcov: Use scoped init guard
compiler-context-analysis: Introduce scoped init guards
cleanup: Make __DEFINE_LOCK_GUARD handle commas in initializers
seqlock: fix scoped_seqlock_read kernel-doc
tools: Update context analysis macros in compiler_types.h
rust: sync: Replace `kernel::c_str!` with C-Strings
rust: sync: Inline various lock related methods
rust: helpers: Move #define __rust_helper out of atomic.c
rust: wait: Add __rust_helper to helpers
rust: time: Add __rust_helper to helpers
rust: task: Add __rust_helper to helpers
rust: sync: Add __rust_helper to helpers
rust: refcount: Add __rust_helper to helpers
rust: rcu: Add __rust_helper to helpers
rust: processor: Add __rust_helper to helpers
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Pull bpf updates from Alexei Starovoitov:
- Support associating BPF program with struct_ops (Amery Hung)
- Switch BPF local storage to rqspinlock and remove recursion detection
counters which were causing false positives (Amery Hung)
- Fix live registers marking for indirect jumps (Anton Protopopov)
- Introduce execution context detection BPF helpers (Changwoo Min)
- Improve verifier precision for 32bit sign extension pattern
(Cupertino Miranda)
- Optimize BTF type lookup by sorting vmlinux BTF and doing binary
search (Donglin Peng)
- Allow states pruning for misc/invalid slots in iterator loops (Eduard
Zingerman)
- In preparation for ASAN support in BPF arenas teach libbpf to move
global BPF variables to the end of the region and enable arena kfuncs
while holding locks (Emil Tsalapatis)
- Introduce support for implicit arguments in kfuncs and migrate a
number of them to new API. This is a prerequisite for cgroup
sub-schedulers in sched-ext (Ihor Solodrai)
- Fix incorrect copied_seq calculation in sockmap (Jiayuan Chen)
- Fix ORC stack unwind from kprobe_multi (Jiri Olsa)
- Speed up fentry attach by using single ftrace direct ops in BPF
trampolines (Jiri Olsa)
- Require frozen map for calculating map hash (KP Singh)
- Fix lock entry creation in TAS fallback in rqspinlock (Kumar
Kartikeya Dwivedi)
- Allow user space to select cpu in lookup/update operations on per-cpu
array and hash maps (Leon Hwang)
- Make kfuncs return trusted pointers by default (Matt Bobrowski)
- Introduce "fsession" support where single BPF program is executed
upon entry and exit from traced kernel function (Menglong Dong)
- Allow bpf_timer and bpf_wq use in all programs types (Mykyta
Yatsenko, Andrii Nakryiko, Kumar Kartikeya Dwivedi, Alexei
Starovoitov)
- Make KF_TRUSTED_ARGS the default for all kfuncs and clean up their
definition across the tree (Puranjay Mohan)
- Allow BPF arena calls from non-sleepable context (Puranjay Mohan)
- Improve register id comparison logic in the verifier and extend
linked registers with negative offsets (Puranjay Mohan)
- In preparation for BPF-OOM introduce kfuncs to access memcg events
(Roman Gushchin)
- Use CFI compatible destructor kfunc type (Sami Tolvanen)
- Add bitwise tracking for BPF_END in the verifier (Tianci Cao)
- Add range tracking for BPF_DIV and BPF_MOD in the verifier (Yazhou
Tang)
- Make BPF selftests work with 64k page size (Yonghong Song)
* tag 'bpf-next-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (268 commits)
selftests/bpf: Fix outdated test on storage->smap
selftests/bpf: Choose another percpu variable in bpf for btf_dump test
selftests/bpf: Remove test_task_storage_map_stress_lookup
selftests/bpf: Update task_local_storage/task_storage_nodeadlock test
selftests/bpf: Update task_local_storage/recursion test
selftests/bpf: Update sk_storage_omem_uncharge test
bpf: Switch to bpf_selem_unlink_nofail in bpf_local_storage_{map_free, destroy}
bpf: Support lockless unlink when freeing map or local storage
bpf: Prepare for bpf_selem_unlink_nofail()
bpf: Remove unused percpu counter from bpf_local_storage_map_free
bpf: Remove cgroup local storage percpu counter
bpf: Remove task local storage percpu counter
bpf: Change local_storage->lock and b->lock to rqspinlock
bpf: Convert bpf_selem_unlink to failable
bpf: Convert bpf_selem_link_map to failable
bpf: Convert bpf_selem_unlink_map to failable
bpf: Select bpf_local_storage_map_bucket based on bpf_local_storage
selftests/xsk: fix number of Tx frags in invalid packet
selftests/xsk: properly handle batch ending in the middle of a packet
bpf: Prevent reentrance into call_rcu_tasks_trace()
...
|
|
Merge series "slab: replace cpu (partial) slabs with sheaves".
The percpu sheaves caching layer was introduced as opt-in but the goal
was to eventually move all caches to them. This is the next step,
enabling sheaves for all caches (except the two bootstrap ones) and then
removing the per cpu (partial) slabs and lots of associated code.
Besides the lower locking overhead and much more likely fastpath when
freeing, this removes the rather complicated code related to the cpu
slab lockless fastpaths (using this_cpu_try_cmpxchg128/64) and all its
complications for PREEMPT_RT or kmalloc_nolock().
The lockless slab freelist+counters update operation using
try_cmpxchg128/64 remains and is crucial for freeing remote NUMA objects
and to allow flushing objects from sheaves to slabs mostly without the
node list_lock.
Link: https://lore.kernel.org/all/20260123-sheaves-for-all-v4-0-041323d506f7@suse.cz/
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks
Pull kthread updates from Frederic Weisbecker:
"The kthread code provides an infrastructure which manages the
preferred affinity of unbound kthreads (node or custom cpumask)
against housekeeping (CPU isolation) constraints and CPU hotplug
events.
One crucial missing piece is the handling of cpuset: when an isolated
partition is created, deleted, or its CPUs updated, all the unbound
kthreads in the top cpuset become indifferently affine to _all_ the
non-isolated CPUs, possibly breaking their preferred affinity along
the way.
Solve this with performing the kthreads affinity update from cpuset to
the kthreads consolidated relevant code instead so that preferred
affinities are honoured and applied against the updated cpuset
isolated partitions.
The dispatch of the new isolated cpumasks to timers, workqueues and
kthreads is performed by housekeeping, as per the nice Tejun's
suggestion.
As a welcome side effect, HK_TYPE_DOMAIN then integrates both the set
from boot defined domain isolation (through isolcpus=) and cpuset
isolated partitions. Housekeeping cpumasks are now modifiable with a
specific RCU based synchronization. A big step toward making
nohz_full= also mutable through cpuset in the future"
* tag 'kthread-for-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks: (33 commits)
doc: Add housekeeping documentation
kthread: Document kthread_affine_preferred()
kthread: Comment on the purpose and placement of kthread_affine_node() call
kthread: Honour kthreads preferred affinity after cpuset changes
sched/arm64: Move fallback task cpumask to HK_TYPE_DOMAIN
sched: Switch the fallback task allowed cpumask to HK_TYPE_DOMAIN
kthread: Rely on HK_TYPE_DOMAIN for preferred affinity management
kthread: Include kthreadd to the managed affinity list
kthread: Include unbound kthreads in the managed affinity list
kthread: Refine naming of affinity related fields
PCI: Remove superfluous HK_TYPE_WQ check
sched/isolation: Remove HK_TYPE_TICK test from cpu_is_isolated()
cpuset: Remove cpuset_cpu_is_isolated()
timers/migration: Remove superfluous cpuset isolation test
cpuset: Propagate cpuset isolation update to timers through housekeeping
cpuset: Propagate cpuset isolation update to workqueue through housekeeping
PCI: Flush PCI probe workqueue on cpuset isolated partition change
sched/isolation: Flush vmstat workqueues on cpuset isolated partition change
sched/isolation: Flush memcg workqueues on cpuset isolated partition change
cpuset: Update HK_TYPE_DOMAIN cpumask from cpuset
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs 'struct filename' updates from Al Viro:
"[Mostly] sanitize struct filename handling"
* tag 'pull-filename' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (68 commits)
sysfs(2): fs_index() argument is _not_ a pathname
alpha: switch osf_mount() to strndup_user()
ksmbd: use CLASS(filename_kernel)
mqueue: switch to CLASS(filename)
user_statfs(): switch to CLASS(filename)
statx: switch to CLASS(filename_maybe_null)
quotactl_block(): switch to CLASS(filename)
chroot(2): switch to CLASS(filename)
move_mount(2): switch to CLASS(filename_maybe_null)
namei.c: switch user pathname imports to CLASS(filename{,_flags})
namei.c: convert getname_kernel() callers to CLASS(filename_kernel)
do_f{chmod,chown,access}at(): use CLASS(filename_uflags)
do_readlinkat(): switch to CLASS(filename_flags)
do_sys_truncate(): switch to CLASS(filename)
do_utimes_path(): switch to CLASS(filename_uflags)
chdir(2): unspaghettify a bit...
do_fchownat(): unspaghettify a bit...
fspick(2): use CLASS(filename_flags)
name_to_handle_at(): use CLASS(filename_uflags)
vfs_open_tree(): use CLASS(filename_uflags)
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs lease updates from Christian Brauner:
"This contains updates for lease support to require filesystems to
explicitly opt-in to lease support
Currently kernel_setlease() falls through to generic_setlease() when a
a filesystem does not define ->setlease(), silently granting lease
support to every filesystem regardless of whether it is prepared for
it.
This is a poor default: most filesystems never intended to support
leases, and the silent fallthrough makes it impossible to distinguish
"supports leases" from "never thought about it".
This inverts the default. It adds explicit
.setlease = generic_setlease;
assignments to every in-tree filesystem that should retain lease
support, then changes kernel_setlease() to return -EINVAL when
->setlease is NULL.
With the new default in place, simple_nosetlease() is redundant and
is removed along with all references to it"
* tag 'vfs-7.0-rc1.leases' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (25 commits)
fuse: add setlease file operation
fs: remove simple_nosetlease()
filelock: default to returning -EINVAL when ->setlease operation is NULL
xfs: add setlease file operation
ufs: add setlease file operation
udf: add setlease file operation
tmpfs: add setlease file operation
squashfs: add setlease file operation
overlayfs: add setlease file operation
orangefs: add setlease file operation
ocfs2: add setlease file operation
ntfs3: add setlease file operation
nilfs2: add setlease file operation
jfs: add setlease file operation
jffs2: add setlease file operation
gfs2: add a setlease file operation
fat: add setlease file operation
f2fs: add setlease file operation
exfat: add setlease file operation
ext4: add setlease file operation
...
|
|
While we handle pte_lockptr() == pmd_lockptr() correctly in
zap_pte_table_if_empty(), we don't handle it in zap_empty_pte_table(),
making the spin_trylock() always fail and forcing us onto the slow path.
So let's handle the scenario where pte_lockptr() == pmd_lockptr() better,
which can only happen if CONFIG_SPLIT_PTE_PTLOCKS is not set.
This is only relevant once we unlock CONFIG_PT_RECLAIM on architectures
that are not x86-64.
Link: https://lkml.kernel.org/r/20260119220708.3438514-3-david@kernel.org
Signed-off-by: David Hildenbrand (Red Hat) <david@kernel.org>
Reviewed-by: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Some cleanups for PT table reclaim code, triggered by a false-positive
warning we might start to see soon after we unlocked pt-reclaim on
architectures besides x86-64.
This patch (of 2):
The pte-table reclaim code is only called from memory.c, while zapping
pages, and it better also stays that way in the long run. If we ever have
to call it from other files, we should expose proper high-level helpers
for zapping if the existing helpers are not good enough.
So, let's move the code over (it's not a lot) and slightly clean it up a
bit by:
- Renaming the functions.
- Dropping the "Check if it is empty PTE page" comment, which is now
self-explaining given the function name.
- Making zap_pte_table_if_empty() return whether zapping worked so the
caller can free it.
- Adding a comment in pte_table_reclaim_possible().
- Inlining free_pte() in the last remaining user.
- In zap_empty_pte_table(), switch from pmdp_get_lcokless() to
pmd_clear(), we are holding the PMD PT lock.
By moving the code over, compilers can also easily figure out when
zap_empty_pte_table() does not initialize the pmdval variable, avoiding
false-positive warnings about the variable possibly not being initialized.
Link: https://lkml.kernel.org/r/20260119220708.3438514-1-david@kernel.org
Link: https://lkml.kernel.org/r/20260119220708.3438514-2-david@kernel.org
Signed-off-by: David Hildenbrand (Red Hat) <david@kernel.org>
Reviewed-by: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The PT_RECLAIM can work on all architectures that support
MMU_GATHER_RCU_TABLE_FREE, except for those that have selected
HAVE_ARCH_TLB_REMOVE_TABLE,so make PT_RECLAIM depends on
MMU_GATHER_RCU_TABLE_FREE && !HAVE_ARCH_TLB_REMOVE_TABLE.
BTW, change PT_RECLAIM to be enabled by default, since nobody should want
to turn it off.
Link: https://lkml.kernel.org/r/83b034810935a9ff18e425b085e065bb0acb28f3.1769515122.git.zhengqi.arch@bytedance.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org>
Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Lance Yang <ioworker0@gmail.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Magnus Lindholm <linmag7@gmail.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Richard Henderson <richard.henderson@linaro.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
CONFIG_HAVE_ARCH_TLB_REMOVE_TABLE config
For architectures that define __HAVE_ARCH_TLB_REMOVE_TABLE, the page
tables at the pmd/pud level are generally not of struct ptdesc type, and
do not have pt_rcu_head member, thus these architectures cannot support
PT_RECLAIM.
In preparation for enabling PT_RECLAIM on more architectures, convert
__HAVE_ARCH_TLB_REMOVE_TABLE to CONFIG_HAVE_ARCH_TLB_REMOVE_TABLE config,
so that we can make conditional judgments in Kconfig.
Link: https://lkml.kernel.org/r/5ebfa3d4b56e63c6906bda5eccaa9f7194d3a86b.1769515122.git.zhengqi.arch@bytedance.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Tested-by: Andreas Larsson <andreas@gaisler.com> [sparc, UP&SMP]
Acked-by: Andreas Larsson <andreas@gaisler.com> [sparc]
Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org>
Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Lance Yang <ioworker0@gmail.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Magnus Lindholm <linmag7@gmail.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Richard Henderson <richard.henderson@linaro.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "enable PT_RECLAIM on more 64-bit architectures", v4.
This series aims to enable PT_RECLAIM on more 64-bit architectures.
On a 64-bit system, madvise(MADV_DONTNEED) may cause a large number of
empty PTE page table pages (such as 100GB+). To resolve this problem, we
need to enable PT_RECLAIM, which depends on MMU_GATHER_RCU_TABLE_FREE.
For these architectures that define its own __tlb_remove_table(), since
their page tables are not of type struct ptdesc, they cannot be supported
PT_RECLAIM.
Therefore, this series first enables MMU_GATHER_RCU_TABLE_FREE on all
64-bit architectures, then converts __HAVE_ARCH_TLB_REMOVE_TABLE to
CONFIG_HAVE_ARCH_TLB_REMOVE_TABLE config, and finally makes PT_RECLAIM
depend on MMU_GATHER_RCU_TABLE_FREE && !HAVE_ARCH_TLB_REMOVE_TABLE. This
way, PT_RECLAIM can be enabled by default on most 64-bit architectures.
Of course, this will also be enabled on some 32-bit architectures that
already support MMU_GATHER_RCU_TABLE_FREE. That's fine, PT_RECLAIM works
well on all 32-bit architectures as well. Although the benefit isn't
significant, there's still memory that can be reclaimed. Perhaps
PT_RECLAIM can be enabled on all 32-bit architectures in the future.
This patch (of 8):
Generally, the asm/tlb.h will include asm-generic/tlb.h, so change
mm/pt_reclaim.c to use asm/tlb.h instead of asm-generic/tlb.h. This is a
preparation for enabling CONFIG_PT_RECLAIM on other architectures, such as
alpha.
Link: https://lkml.kernel.org/r/cover.1769515122.git.zhengqi.arch@bytedance.com
Link: https://lkml.kernel.org/r/befca537d10c6bf8d531b1ee0a8af1e3b31352b0.1769515122.git.zhengqi.arch@bytedance.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lance Yang <ioworker0@gmail.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Magnus Lindholm <linmag7@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Gleixner <tglx@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Will Deacon <will@kernel.org>
Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Richard Henderson <richard.henderson@linaro.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: WANG Xuerui <kernel@xen0n.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The 'memory_idle_ms_percentiles' array in DAMON_STAT is updated frequently
by the kernel to reflect the latest idle time statistics. Marking it as
'__read_mostly' is inappropriate for data that is regularly written to, as
it can lead to cache pollution in the read-mostly section.
Remove the '__read_mostly' annotation to accurately reflect the
variable's usage pattern.
Link: https://lkml.kernel.org/r/20260130085603.1814-1-lirongqing@baidu.com
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Currently, zsmalloc creates kmem_cache of handles and zspages for each
pool, which may be suboptimal from the memory usage point of view (extra
internal fragmentation per pool). Systems that create multiple zsmalloc
pools may benefit from shared common zsmalloc caches.
Make handles and zspages kmem caches global. The memory savings depend on
particular setup and data patterns and can be found via slabinfo.
Link: https://lkml.kernel.org/r/20260117025406.799428-1-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Reviewed-by: Nhat Pham <nphamcs@gmail.com>
Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Cc: Brian Geffon <bgeffon@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Some of the memory management source files are missing
SPDX-License-Identifier lines. Add appropriate IDs
to these files (mostly GPL-2.0, but one LGPL-2.1).
Link: https://lkml.kernel.org/r/20260204213101.1754183-1-tim.bird@sony.com
Signed-off-by: Tim Bird <tim.bird@sony.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Use the %pe printk format specifier to report error pointers directly
instead of printing PTR_ERR() as a long value. This improves clarity,
produces more readable error messages.
This instance was flagged by the Coccinelle script
(misc/ptr_err_to_pe.cocci) as an opportunity to adopt %pe.
Found by: make coccicheck MODE=report M=mm/
No functional change intended.
Link: https://lkml.kernel.org/r/581a26f22fb4c6ce04aeb7ee0d703fe64454ac7f.1770230135.git.chandna.sahil@gmail.com
Signed-off-by: Sahil Chandna <chandna.sahil@gmail.com>
Acked-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Acked-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Use the %pe printk format specifier to report error pointers directly
instead of printing PTR_ERR() as a long value. This improves clarity,
produces more readable error messages.
This instance was flagged by the Coccinelle script
(misc/ptr_err_to_pe.cocci) as an opportunity to adopt %pe.
Found by: make coccicheck MODE=report M=mm/
No functional change intended
Link: https://lkml.kernel.org/r/80a6643657a60e75ddf48b4869b3e7fdc101f855.1770230135.git.chandna.sahil@gmail.com
Signed-off-by: Sahil Chandna <chandna.sahil@gmail.com>
Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
Reviewed-by: SeongJae Park <sj@kernel.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Yosry Ahmed <yosry.ahmed@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Fix a typo in a comment: max_readhead -> max_readahead.
Link: https://lkml.kernel.org/r/20260127152535.321951-1-cheng20011202@gmail.com
Signed-off-by: Wilson Zeng <cheng20011202@gmail.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
In META's fleet, we observed high-level cgroups showing zero file memcg
stats while their descendants had non-zero values. Investigation using
drgn revealed that these parent cgroups actually had negative file stats,
aggregated from their children.
This issue became more frequent after deploying thp-always more widely,
pointing to a correlation with THP file collapsing. The root cause is
that collapse_file() assumes old folios and the new THP belong to the same
node and memcg. When this assumption breaks, stats become skewed. The
bug affects not just memcg stats but also per-numa stats, and not just
NR_FILE_PAGES but also NR_SHMEM.
The assumption breaks in scenarios such as:
1. Small folios allocated on one node while the THP gets allocated on a
different node.
2. A package downloader running in one cgroup populates the page cache,
while a job in a different cgroup executes the downloaded binary.
3. A file shared between processes in different cgroups, where one
process faults in the pages and khugepaged (or madvise(COLLAPSE))
collapses them on behalf of the other.
Fix the accounting by explicitly incrementing stats for the new THP and
decrementing stats for the old folios being replaced.
Link: https://lkml.kernel.org/r/20260130042925.2797946-1-shakeel.butt@linux.dev
Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages")
Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Dev Jain <dev.jain@arm.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Kiryl Shutsemau <kas@kernel.org>
Acked-by: David Hildenbrand (arm) <david@kernel.org>
Cc: Lance Yang <lance.yang@linux.dev>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Nico Pache <npache@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Usama Arif <usamaarif642@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
vma_map_pages currently calls vm_insert_page on each individual page in
the mapping, which creates significant overhead because we are repeatedly
spinlocking. Instead, we should batch insert pages using vm_insert_pages,
which amortizes the cost of the spinlock.
Tested through watching hardware accelerated video on a MTK ChromeOS
device. This particular path maps both a V4L2 buffer and a GEM allocated
buffer into userspace and converts the contents from one pixel format to
another. Both vb2_mmap() and mtk_gem_object_mmap() exercise this pathway.
Link: https://lkml.kernel.org/r/20260128225648.2938636-1-greenjustin@chromium.org
Signed-off-by: Justin Green <greenjustin@chromium.org>
Acked-by: Brian Geffon <bgeffon@google.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Arjun Roy <arjunroy@google.com>
Cc: David Hildenbrand <david@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Currently, DAMON defines two identical structures for representing address
ranges: damon_system_ram_region and damon_addr_range. Both structures
share the same semantic interpretation of a half-open interval [start,
end), where the start address is inclusive and the end address is
exclusive.
This duplication adds unnecessary redundancy and increases maintenance
overhead. This patch replaces all uses of damon_system_ram_region with
the more generic damon_addr_range structure, ensuring a unified type
representation for address ranges within the DAMON subsystem. The change
simplifies the codebase, improves readability, and avoids potential
inconsistencies in future modifications.
Link: https://lkml.kernel.org/r/20260129100845.281734-1-lienze@kylinos.cn
Signed-off-by: Enze Li <lienze@kylinos.cn>
Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Replace snprintf("%s", ...) with the faster and more direct strscpy().
Link: https://lkml.kernel.org/r/20260126174516.236968-1-thorsten.blum@linux.dev
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: SeongJae Park <sj@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Use the new zs_obj_read_sg_*() APIs in zswap_decompress(), instead of
zs_obj_read_*() APIs returning a linear address. The SG list is passed
directly to the crypto API, simplifying the logic and dropping the
workaround that copies highmem addresses to a buffer. The crypto API
should internally linearize the SG list if needed.
This avoids the memcpy() in zsmalloc for objects spanning multiple pages,
although an equivalent operation will be done internally by acomp/scomp.
However, in the future compression algorithms could support handling
discontiguous SG lists, completely eliminating the copying for spanning
objects.
Zsmalloc fills an SG list up to 2 entries in size, so change the input SG
list to fit 2 entries.
Update the incompressible entries path to use memcpy_from_sglist() to copy
the data to the folio. Opportunistically set dlen to PAGE_SIZE in the
same code path (rather that at the top of the function) to make it
clearer.
Drop the goto in zswap_compress() as the code now is not simple enough for
an if-else statement instead. Rename 'decomp_ret' to 'ret' and reuse it
to keep the intermediate return value of crypto_acomp_decompress() to keep
line lengths manageable.
No functional change intended.
Link: https://lkml.kernel.org/r/20260121013615.2906368-1-yosry.ahmed@linux.dev
Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Acked-by: Nhat Pham <nphamcs@gmail.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull hotfixes from Andrew Morton:
"A couple of late-breaking MM fixes. One against a new-in-this-cycle
patch and the other addresses a locking issue which has been there for
over a year"
* tag 'mm-hotfixes-stable-2026-02-06-12-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
mm/memory-failure: reject unsupported non-folio compound page
procfs: avoid fetching build ID while holding VMA lock
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab
Pull slab fix from Vlastimil Babka:
"A stable fix for memory allocation profiling tag not being cleared
when aborting an allocation due to memcg charge failure (Hao Ge)"
* tag 'slab-for-6.19-rc8-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab:
mm/slab: Add alloc_tagging_slab_free_hook for memcg_alloc_abort_single
|
|
'core' into next
|
|
SLAB_NO_OBJ_EXT is set for boot caches, but need_slab_obj_exts() doesn't
check this flag. We should return false unconditionally when
SLAB_NO_OBJ_EXT is set.
Signed-off-by: Hao Li <hao.li@linux.dev>
Acked-by: Harry Yoo <harry.yoo@oracle.com>
Link: https://patch.msgid.link/20260205120709.425719-1-hao.li@linux.dev
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
When CONFIG_MEM_ALLOC_PROFILING_DEBUG is enabled, the following warning
may be noticed:
[ 3959.023862] ------------[ cut here ]------------
[ 3959.023891] alloc_tag was not cleared (got tag for lib/xarray.c:378)
[ 3959.023947] WARNING: ./include/linux/alloc_tag.h:155 at alloc_tag_add+0x128/0x178, CPU#6: mkfs.ntfs/113998
[ 3959.023978] Modules linked in: dns_resolver tun brd overlay exfat btrfs blake2b libblake2b xor xor_neon raid6_pq loop sctp ip6_udp_tunnel udp_tunnel ext4 crc16 mbcache jbd2 rfkill sunrpc vfat fat sg fuse nfnetlink sr_mod virtio_gpu cdrom drm_client_lib virtio_dma_buf drm_shmem_helper drm_kms_helper ghash_ce drm sm4 backlight virtio_net net_failover virtio_scsi failover virtio_console virtio_blk virtio_mmio dm_mirror dm_region_hash dm_log dm_multipath dm_mod i2c_dev aes_neon_bs aes_ce_blk [last unloaded: hwpoison_inject]
[ 3959.024170] CPU: 6 UID: 0 PID: 113998 Comm: mkfs.ntfs Kdump: loaded Tainted: G W 6.19.0-rc7+ #7 PREEMPT(voluntary)
[ 3959.024182] Tainted: [W]=WARN
[ 3959.024186] Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022
[ 3959.024192] pstate: 604000c5 (nZCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 3959.024199] pc : alloc_tag_add+0x128/0x178
[ 3959.024207] lr : alloc_tag_add+0x128/0x178
[ 3959.024214] sp : ffff80008b696d60
[ 3959.024219] x29: ffff80008b696d60 x28: 0000000000000000 x27: 0000000000000240
[ 3959.024232] x26: 0000000000000000 x25: 0000000000000240 x24: ffff800085d17860
[ 3959.024245] x23: 0000000000402800 x22: ffff0000c0012dc0 x21: 00000000000002d0
[ 3959.024257] x20: ffff0000e6ef3318 x19: ffff800085ae0410 x18: 0000000000000000
[ 3959.024269] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
[ 3959.024281] x14: 0000000000000000 x13: 0000000000000001 x12: ffff600064101293
[ 3959.024292] x11: 1fffe00064101292 x10: ffff600064101292 x9 : dfff800000000000
[ 3959.024305] x8 : 00009fff9befed6e x7 : ffff000320809493 x6 : 0000000000000001
[ 3959.024316] x5 : ffff000320809490 x4 : ffff600064101293 x3 : ffff800080691838
[ 3959.024328] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff0000d5bcd640
[ 3959.024340] Call trace:
[ 3959.024346] alloc_tag_add+0x128/0x178 (P)
[ 3959.024355] __alloc_tagging_slab_alloc_hook+0x11c/0x1a8
[ 3959.024362] kmem_cache_alloc_lru_noprof+0x1b8/0x5e8
[ 3959.024369] xas_alloc+0x304/0x4f0
[ 3959.024381] xas_create+0x1e0/0x4a0
[ 3959.024388] xas_store+0x68/0xda8
[ 3959.024395] __filemap_add_folio+0x5b0/0xbd8
[ 3959.024409] filemap_add_folio+0x16c/0x7e0
[ 3959.024416] __filemap_get_folio_mpol+0x2dc/0x9e8
[ 3959.024424] iomap_get_folio+0xfc/0x180
[ 3959.024435] __iomap_get_folio+0x2f8/0x4b8
[ 3959.024441] iomap_write_begin+0x198/0xc18
[ 3959.024448] iomap_write_iter+0x2ec/0x8f8
[ 3959.024454] iomap_file_buffered_write+0x19c/0x290
[ 3959.024461] blkdev_write_iter+0x38c/0x978
[ 3959.024470] vfs_write+0x4d4/0x928
[ 3959.024482] ksys_write+0xfc/0x1f8
[ 3959.024489] __arm64_sys_write+0x74/0xb0
[ 3959.024496] invoke_syscall+0xd4/0x258
[ 3959.024507] el0_svc_common.constprop.0+0xb4/0x240
[ 3959.024514] do_el0_svc+0x48/0x68
[ 3959.024520] el0_svc+0x40/0xf8
[ 3959.024526] el0t_64_sync_handler+0xa0/0xe8
[ 3959.024533] el0t_64_sync+0x1ac/0x1b0
[ 3959.024540] ---[ end trace 0000000000000000 ]---
When __memcg_slab_post_alloc_hook() fails, there are two different
free paths depending on whether size == 1 or size != 1. In the
kmem_cache_free_bulk() path, we do call alloc_tagging_slab_free_hook().
However, in memcg_alloc_abort_single() we don't, the above warning will be
triggered on the next allocation.
Therefore, add alloc_tagging_slab_free_hook() to the
memcg_alloc_abort_single() path.
Fixes: 9f9796b413d3 ("mm, slab: move memcg charging to post-alloc hook")
Cc: stable@vger.kernel.org
Suggested-by: Hao Li <hao.li@linux.dev>
Signed-off-by: Hao Ge <hao.ge@linux.dev>
Reviewed-by: Hao Li <hao.li@linux.dev>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Link: https://patch.msgid.link/20260204101401.202762-1-hao.ge@linux.dev
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
When !CONFIG_TRANSPARENT_HUGEPAGE, a non-folio compound page can appear in
a userspace mapping via either vm_insert_*() functions or
vm_operatios_struct->fault(). They are not folios, thus should not be
considered for folio operations like split. To reject these pages, make
sure get_hwpoison_page() is always called as HWPoisonHandlable() will do
the right work.
[Some commit log borrowed from Zi Yan. Thanks.]
Link: https://lkml.kernel.org/r/20260205075328.523211-1-linmiaohe@huawei.com
Fixes: 689b8986776c ("mm/memory-failure: improve large block size folio handling")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reported-by: 是参差 <shicenci@gmail.com>
Closes: https://lore.kernel.org/all/PS1PPF7E1D7501F1E4F4441E7ECD056DEADAB98A@PS1PPF7E1D7501F.apcprd02.prod.outlook.com/
Reviewed-by: Zi Yan <ziy@nvidia.com>
Tested-by: Zi Yan <ziy@nvidia.com>
Cc: David Hildenbrand <david@kernel.org>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"Five hotfixes. Two are cc:stable, two are for MM.
All are singletons - please see the changelogs for details"
* tag 'mm-hotfixes-stable-2026-02-04-15-55' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
Documentation: document liveupdate cmdline parameter
mm, shmem: prevent infinite loop on truncate race
mailmap: update Alexander Mikhalitsyn's emails
liveupdate: luo_file: do not clear serialized_data on unfreeze
x86/kfence: fix booting on 32bit non-PAE systems
|
|
While SLAB_OBJ_EXT_IN_OBJ allows to reduce memory overhead to account
slab objects, it prevents slab merging because merging can change
the metadata layout.
As pointed out Vlastimil Babka, disabling merging solely for this memory
optimization may not be a net win, because disabling slab merging tends
to increase overall memory usage.
Restrict SLAB_OBJ_EXT_IN_OBJ to caches that are already unmergeable for
other reasons (e.g., those with constructors or SLAB_TYPESAFE_BY_RCU).
Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
Link: https://patch.msgid.link/20260127103151.21883-3-harry.yoo@oracle.com
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
When a cache has high s->align value and s->object_size is not aligned
to it, each object ends up with some unused space because of alignment.
If this wasted space is big enough, we can use it to store the
slabobj_ext metadata instead of wasting it.
On my system, this happens with caches like kmem_cache, mm_struct, pid,
task_struct, sighand_cache, xfs_inode, and others.
To place the slabobj_ext metadata within each object, the existing
slab_obj_ext() logic can still be used by setting:
- slab->obj_exts = slab_address(slab) + (slabobj_ext offset)
- stride = s->size
slab_obj_ext() doesn't need know where the metadata is stored,
so this method works without adding extra overhead to slab_obj_ext().
A good example benefiting from this optimization is xfs_inode
(object_size: 992, align: 64). To measure memory savings, 2 millions of
files were created on XFS.
[ MEMCG=y, MEM_ALLOC_PROFILING=n ]
Before patch (creating ~2.64M directories on xfs):
Slab: 5175976 kB
SReclaimable: 3837524 kB
SUnreclaim: 1338452 kB
After patch (creating ~2.64M directories on xfs):
Slab: 5152912 kB
SReclaimable: 3838568 kB
SUnreclaim: 1314344 kB (-23.54 MiB)
Enjoy the memory savings!
Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
Link: https://patch.msgid.link/20260113061845.159790-10-harry.yoo@oracle.com
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
To access SLUB's internal implementation details beyond cache flags in
ksize(), move __ksize(), ksize(), and slab_ksize() to mm/slub.c.
[vbabka@suse.cz: also make __ksize() static and move its kerneldoc to
ksize() ]
Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
Link: https://patch.msgid.link/20260113061845.159790-9-harry.yoo@oracle.com
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
The leftover space in a slab is always smaller than s->size, and
kmem caches for large objects that are not power-of-two sizes tend to have
a greater amount of leftover space per slab. In some cases, the leftover
space is larger than the size of the slabobj_ext array for the slab.
An excellent example of such a cache is ext4_inode_cache. On my system,
the object size is 1136, with a preferred order of 3, 28 objects per slab,
and 960 bytes of leftover space per slab.
Since the size of the slabobj_ext array is only 224 bytes (w/o mem
profiling) or 448 bytes (w/ mem profiling) per slab, the entire array
fits within the leftover space.
Allocate the slabobj_exts array from this unused space instead of using
kcalloc() when it is large enough. The array is allocated from unused
space only when creating new slabs, and it doesn't try to utilize unused
space if alloc_slab_obj_exts() is called after slab creation because
implementing lazy allocation involves more expensive synchronization.
The implementation and evaluation of lazy allocation from unused space
is left as future-work. As pointed by Vlastimil Babka [1], it could be
beneficial when a slab cache without SLAB_ACCOUNT can be created, and
some of the allocations from the cache use __GFP_ACCOUNT. For example,
xarray does that.
To avoid unnecessary overhead when MEMCG (with SLAB_ACCOUNT) and
MEM_ALLOC_PROFILING are not used for the cache, allocate the slabobj_ext
array only when either of them is enabled on slab allocation.
[ MEMCG=y, MEM_ALLOC_PROFILING=n ]
Before patch (creating ~2.64M directories on ext4):
Slab: 4747880 kB
SReclaimable: 4169652 kB
SUnreclaim: 578228 kB
After patch (creating ~2.64M directories on ext4):
Slab: 4724020 kB
SReclaimable: 4169188 kB
SUnreclaim: 554832 kB (-22.84 MiB)
Enjoy the memory savings!
Link: https://lore.kernel.org/linux-mm/48029aab-20ea-4d90-bfd1-255592b2018e@suse.cz [1]
Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
Link: https://patch.msgid.link/20260113061845.159790-8-harry.yoo@oracle.com
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
In the near future, slabobj_ext may reside outside the allocated slab
object range within a slab, which could be reported as an out-of-bounds
access by KASAN.
As suggested by Andrey Konovalov [1], explicitly disable KASAN and KMSAN
checks when accessing slabobj_ext within slab allocator, memory profiling,
and memory cgroup code. While an alternative approach could be to unpoison
slabobj_ext, out-of-bounds accesses outside the slab allocator are
generally more common.
Move metadata_access_enable()/disable() helpers to mm/slab.h so that
it can be used outside mm/slub.c. However, as suggested by Suren
Baghdasaryan [2], instead of calling them directly from mm code (which is
more prone to errors), change users to access slabobj_ext via get/put
APIs:
- Users should call get_slab_obj_exts() to access slabobj_metadata
and call put_slab_obj_exts() when it's done.
- From now on, accessing it outside the section covered by
get_slab_obj_exts() ~ put_slab_obj_exts() is illegal.
This ensures that accesses to slabobj_ext metadata won't be reported
as access violations.
Call kasan_reset_tag() in slab_obj_ext() before returning the address to
prevent SW or HW tag-based KASAN from reporting false positives.
Suggested-by: Andrey Konovalov <andreyknvl@gmail.com>
Suggested-by: Suren Baghdasaryan <surenb@google.com>
Link: https://lore.kernel.org/linux-mm/CA+fCnZezoWn40BaS3cgmCeLwjT+5AndzcQLc=wH3BjMCu6_YCw@mail.gmail.com [1]
Link: https://lore.kernel.org/linux-mm/CAJuCfpG=Lb4WhYuPkSpdNO4Ehtjm1YcEEK0OM=3g9i=LxmpHSQ@mail.gmail.com [2]
Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
Link: https://patch.msgid.link/20260113061845.159790-7-harry.yoo@oracle.com
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
Use a configurable stride value when accessing slab object extension
metadata instead of assuming a fixed sizeof(struct slabobj_ext).
Store stride value in free bits of slab->counters field. This allows
for flexibility in cases where the extension is embedded within
slab objects.
Since these free bits exist only on 64-bit, any future optimizations
that need to change stride value cannot be enabled on 32-bit architectures.
Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
Link: https://patch.msgid.link/20260113061845.159790-6-harry.yoo@oracle.com
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
Currently, the slab allocator assumes that slab->obj_exts is a pointer
to an array of struct slabobj_ext objects. However, to support storage
methods where struct slabobj_ext is embedded within objects, the slab
allocator should not make this assumption. Instead of directly
dereferencing the slabobj_exts array, abstract access to
struct slabobj_ext via helper functions.
Introduce a new API slabobj_ext metadata access:
slab_obj_ext(slab, obj_exts, index) - returns the pointer to
struct slabobj_ext element at the given index.
Directly dereferencing the return value of slab_obj_exts() is no longer
allowed. Instead, slab_obj_ext() must always be used to access
individual struct slabobj_ext objects.
Convert all users to use these APIs.
No functional changes intended.
Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
Link: https://patch.msgid.link/20260113061845.159790-5-harry.yoo@oracle.com
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
When a slab cache has a constructor, the free pointer is placed after the
object because certain fields must not be overwritten even after the
object is freed.
However, some fields that the constructor does not initialize can safely
be overwritten after free. Allow specifying the free pointer offset within
the object, reducing the overall object size when some fields can be reused
for the free pointer.
Adjust the document accordingly.
Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
Link: https://patch.msgid.link/20260113061845.159790-3-harry.yoo@oracle.com
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
When both KASAN and SLAB_STORE_USER are enabled, accesses to
struct kasan_alloc_meta fields can be misaligned on 64-bit architectures.
This occurs because orig_size is currently defined as unsigned int,
which only guarantees 4-byte alignment. When struct kasan_alloc_meta is
placed after orig_size, it may end up at a 4-byte boundary rather than
the required 8-byte boundary on 64-bit systems.
Note that 64-bit architectures without HAVE_EFFICIENT_UNALIGNED_ACCESS
are assumed to require 64-bit accesses to be 64-bit aligned.
See HAVE_64BIT_ALIGNED_ACCESS and commit adab66b71abf ("Revert:
"ring-buffer: Remove HAVE_64BIT_ALIGNED_ACCESS"") for more details.
Change orig_size from unsigned int to unsigned long to ensure proper
alignment for any subsequent metadata. This should not waste additional
memory because kmalloc objects are already aligned to at least
ARCH_KMALLOC_MINALIGN.
Closes: https://lore.kernel.org/all/aPrLF0OUK651M4dk@hyeyoo
Suggested-by: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: stable@vger.kernel.org
Fixes: 6edf2576a6cc ("mm/slub: enable debugging memory wasting of kmalloc")
Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
Closes: https://lore.kernel.org/all/aPrLF0OUK651M4dk@hyeyoo/
Link: https://patch.msgid.link/20260113061845.159790-2-harry.yoo@oracle.com
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
The comments above check_pad_bytes() document the field layout of a
single object. Rewrite them to improve clarity and precision.
Also update an outdated comment in calculate_sizes().
Suggested-by: Harry Yoo <harry.yoo@oracle.com>
Acked-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Hao Li <hao.li@linux.dev>
Link: https://patch.msgid.link/20251229122415.192377-1-hao.li@linux.dev
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
When allocating slabobj_ext array in alloc_slab_obj_exts(), the array
can be allocated from the same slab we're allocating the array for.
This led to obj_exts_in_slab() incorrectly returning true [1],
although the array is not allocated from wasted space of the slab.
Vlastimil Babka observed that this problem should be fixed even when
ignoring its incompatibility with obj_exts_in_slab(), because it creates
slabs that are never freed as there is always at least one allocated
object.
To avoid this, use the next kmalloc size or large kmalloc when
the array can be allocated from the same cache we're allocating
the array for.
In case of random kmalloc caches, there are multiple kmalloc caches
for the same size and the cache is selected based on the caller address.
Because it is fragile to ensure the same caller address is passed to
kmalloc_slab(), kmalloc_noprof(), and kmalloc_node_noprof(), bump the
size to (s->object_size + 1) when the sizes are equal, instead of
directly comparing the kmem_cache pointers.
Note that this doesn't happen when memory allocation profiling is
disabled, as when the allocation of the array is triggered by memory
cgroup (KMALLOC_CGROUP), the array is allocated from KMALLOC_NORMAL.
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202601231457.f7b31e09-lkp@intel.com [1]
Cc: stable@vger.kernel.org
Fixes: 4b8736964640 ("mm/slab: add allocation accounting into slab allocation and free paths")
Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
Link: https://patch.msgid.link/20260126125714.88008-1-harry.yoo@oracle.com
Reviewed-by: Hao Li <hao.li@linux.dev>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
The HK_TYPE_DOMAIN housekeeping cpumask is now modifiable at runtime.
In order to synchronize against vmstat workqueue to make sure
that no asynchronous vmstat work is still pending or executing on a
newly made isolated CPU, the housekeeping susbsystem must flush the
vmstat workqueues.
This involves flushing the whole mm_percpu_wq workqueue, shared with
LRU drain, introducing here a welcome side effect.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Marco Crivellari <marco.crivellari@suse.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: linux-mm@kvack.org
|
|
The HK_TYPE_DOMAIN housekeeping cpumask is now modifiable at runtime. In
order to synchronize against memcg workqueue to make sure that no
asynchronous draining is still pending or executing on a newly made
isolated CPU, the housekeeping susbsystem must flush the memcg
workqueues.
However the memcg workqueues can't be flushed easily since they are
queued to the main per-CPU workqueue pool.
Solve this with creating a memcg specific pool and provide and use the
appropriate flushing API.
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Marco Crivellari <marco.crivellari@suse.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: cgroups@vger.kernel.org
Cc: linux-mm@kvack.org
|
|
The HK_TYPE_DOMAIN housekeeping cpumask will soon be made modifiable at
runtime. In order to synchronize against vmstat workqueue to make sure
that no asynchronous vmstat work is pending or executing on a newly made
isolated CPU, target and queue a vmstat work under the same RCU read
side critical section.
Whenever housekeeping will update the HK_TYPE_DOMAIN cpumask, a vmstat
workqueue flush will also be issued in a further change to make sure
that no work remains pending after a CPU has been made isolated.
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Marco Crivellari <marco.crivellari@suse.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: linux-mm@kvack.org
|
|
The HK_TYPE_DOMAIN housekeeping cpumask will soon be made modifiable at
runtime. In order to synchronize against memcg workqueue to make sure
that no asynchronous draining is pending or executing on a newly made
isolated CPU, target and queue a drain work under the same RCU critical
section.
Whenever housekeeping will update the HK_TYPE_DOMAIN cpumask, a memcg
workqueue flush will also be issued in a further change to make sure
that no work remains pending after a CPU has been made isolated.
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Marco Crivellari <marco.crivellari@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
|