user/sven/linux.git/virt/kvm, branch v5.15.103

KVM: fix memoryleak in kvm_init()

2023-03-17T07:49:04Z

commit 5a2a961be2ad6a16eb388a80442443b353c11d16 upstream. When alloc_cpumask_var_node() fails for a certain cpu, there might be some allocated cpumasks for percpu cpu_kick_mask. We should free these cpumasks or memoryleak will occur. Fixes: baff59ccdc65 ("KVM: Pre-allocate cpumasks for kvm_make_all_cpus_request_except()") Signed-off-by: Miaohe Lin Link: https://lore.kernel.org/r/20220823063414.59778-1-linmiaohe@huawei.com Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini Signed-off-by: Greg Kroah-Hartman

KVM: Register /dev/kvm as the _very_ last thing during initialization

2023-03-17T07:48:49Z

[ Upstream commit 2b01281273738bf2d6551da48d65db2df3f28998 ] Register /dev/kvm, i.e. expose KVM to userspace, only after all other setup has completed. Once /dev/kvm is exposed, userspace can start invoking KVM ioctls, creating VMs, etc... If userspace creates a VM before KVM is done with its configuration, bad things may happen, e.g. KVM will fail to properly migrate vCPU state if a VM is created before KVM has registered preemption notifiers. Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson Message-Id: <20221130230934.1014142-2-seanjc@google.com> Signed-off-by: Paolo Bonzini Signed-off-by: Sasha Levin

KVM: Pre-allocate cpumasks for kvm_make_all_cpus_request_except()

2023-03-17T07:48:49Z

[ Upstream commit baff59ccdc657d290be51b95b38ebe5de40036b4 ] Allocating cpumask dynamically in zalloc_cpumask_var() is not ideal. Allocation is somewhat slow and can (in theory and when CPUMASK_OFFSTACK) fail. kvm_make_all_cpus_request_except() already disables preemption so we can use pre-allocated per-cpu cpumasks instead. Signed-off-by: Vitaly Kuznetsov Reviewed-by: Sean Christopherson Signed-off-by: Paolo Bonzini Message-Id: <20210903075141.403071-8-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini Stable-dep-of: 2b0128127373 ("KVM: Register /dev/kvm as the _very_ last thing during initialization") Signed-off-by: Sasha Levin

KVM: Optimize kvm_make_vcpus_request_mask() a bit

2023-03-17T07:48:49Z

[ Upstream commit ae0946cd3601752dc58f86d84258e5361e9c8cd4 ] Iterating over set bits in 'vcpu_bitmap' should be faster than going through all vCPUs, especially when just a few bits are set. Drop kvm_make_vcpus_request_mask() call from kvm_make_all_cpus_request_except() to avoid handling the special case when 'vcpu_bitmap' is NULL, move the code to kvm_make_all_cpus_request_except() itself. Signed-off-by: Vitaly Kuznetsov Reviewed-by: Sean Christopherson Signed-off-by: Paolo Bonzini Message-Id: <20210903075141.403071-5-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini Stable-dep-of: 2b0128127373 ("KVM: Register /dev/kvm as the _very_ last thing during initialization") Signed-off-by: Sasha Levin

KVM: Destroy target device if coalesced MMIO unregistration fails

2023-03-10T08:40:00Z

commit b1cb1fac22abf102ffeb29dd3eeca208a3869d54 upstream. Destroy and free the target coalesced MMIO device if unregistering said device fails. As clearly noted in the code, kvm_io_bus_unregister_dev() does not destroy the target device. BUG: memory leak unreferenced object 0xffff888112a54880 (size 64): comm "syz-executor.2", pid 5258, jiffies 4297861402 (age 14.129s) hex dump (first 32 bytes): 38 c7 67 15 00 c9 ff ff 38 c7 67 15 00 c9 ff ff 8.g.....8.g..... e0 c7 e1 83 ff ff ff ff 00 30 67 15 00 c9 ff ff .........0g..... backtrace: [<0000000006995a8a>] kmalloc include/linux/slab.h:556 [inline] [<0000000006995a8a>] kzalloc include/linux/slab.h:690 [inline] [<0000000006995a8a>] kvm_vm_ioctl_register_coalesced_mmio+0x8e/0x3d0 arch/x86/kvm/../../../virt/kvm/coalesced_mmio.c:150 [<00000000022550c2>] kvm_vm_ioctl+0x47d/0x1600 arch/x86/kvm/../../../virt/kvm/kvm_main.c:3323 [<000000008a75102f>] vfs_ioctl fs/ioctl.c:46 [inline] [<000000008a75102f>] file_ioctl fs/ioctl.c:509 [inline] [<000000008a75102f>] do_vfs_ioctl+0xbab/0x1160 fs/ioctl.c:696 [<0000000080e3f669>] ksys_ioctl+0x76/0xa0 fs/ioctl.c:713 [<0000000059ef4888>] __do_sys_ioctl fs/ioctl.c:720 [inline] [<0000000059ef4888>] __se_sys_ioctl fs/ioctl.c:718 [inline] [<0000000059ef4888>] __x64_sys_ioctl+0x6f/0xb0 fs/ioctl.c:718 [<000000006444fa05>] do_syscall_64+0x9f/0x4e0 arch/x86/entry/common.c:290 [<000000009a4ed50b>] entry_SYSCALL_64_after_hwframe+0x49/0xbe BUG: leak checking failed Fixes: 5d3c4c79384a ("KVM: Stop looking for coalesced MMIO zones if the bus is destroyed") Cc: stable@vger.kernel.org Reported-by: 柳菁峰 Reported-by: Michal Luczaj Link: https://lore.kernel.org/r/20221219171924.67989-1-seanjc@google.com Link: https://lore.kernel.org/all/20230118220003.1239032-1-mhal@rbox.co Signed-off-by: Sean Christopherson Signed-off-by: Greg Kroah-Hartman

kvm: Add support for arch compat vm ioctls

2022-10-29T08:12:54Z

commit ed51862f2f57cbce6fed2d4278cfe70a490899fd upstream. We will introduce the first architecture specific compat vm ioctl in the next patch. Add all necessary boilerplate to allow architectures to override compat vm ioctls when necessary. Signed-off-by: Alexander Graf Message-Id: <20221017184541.2658-2-graf@amazon.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini Signed-off-by: Greg Kroah-Hartman

KVM: SEV: add cache flush to solve SEV cache incoherency issues

2022-09-23T12:15:52Z

commit 683412ccf61294d727ead4a73d97397396e69a6b upstream. Flush the CPU caches when memory is reclaimed from an SEV guest (where reclaim also includes it being unmapped from KVM's memslots). Due to lack of coherency for SEV encrypted memory, failure to flush results in silent data corruption if userspace is malicious/broken and doesn't ensure SEV guest memory is properly pinned and unpinned. Cache coherency is not enforced across the VM boundary in SEV (AMD APM vol.2 Section 15.34.7). Confidential cachelines, generated by confidential VM guests have to be explicitly flushed on the host side. If a memory page containing dirty confidential cachelines was released by VM and reallocated to another user, the cachelines may corrupt the new user at a later time. KVM takes a shortcut by assuming all confidential memory remain pinned until the end of VM lifetime. Therefore, KVM does not flush cache at mmu_notifier invalidation events. Because of this incorrect assumption and the lack of cache flushing, malicous userspace can crash the host kernel: creating a malicious VM and continuously allocates/releases unpinned confidential memory pages when the VM is running. Add cache flush operations to mmu_notifier operations to ensure that any physical memory leaving the guest VM get flushed. In particular, hook mmu_notifier_invalidate_range_start and mmu_notifier_release events and flush cache accordingly. The hook after releasing the mmu lock to avoid contention with other vCPUs. Cc: stable@vger.kernel.org Suggested-by: Sean Christpherson Reported-by: Mingwei Zhang Signed-off-by: Mingwei Zhang Message-Id: <20220421031407.2516575-4-mizhang@google.com> Signed-off-by: Paolo Bonzini [OP: adjusted KVM_X86_OP_OPTIONAL() -> KVM_X86_OP_NULL, applied kvm_arch_guest_memory_reclaimed() call in kvm_set_memslot()] Signed-off-by: Ovidiu Panait Signed-off-by: Greg Kroah-Hartman

KVM: Unconditionally get a ref to /dev/kvm module when creating a VM

2022-08-25T09:39:53Z

commit 405294f29faee5de8c10cb9d4a90e229c2835279 upstream. Unconditionally get a reference to the /dev/kvm module when creating a VM instead of using try_get_module(), which will fail if the module is in the process of being forcefully unloaded. The error handling when try_get_module() fails doesn't properly unwind all that has been done, e.g. doesn't call kvm_arch_pre_destroy_vm() and doesn't remove the VM from the global list. Not removing VMs from the global list tends to be fatal, e.g. leads to use-after-free explosions. The obvious alternative would be to add proper unwinding, but the justification for using try_get_module(), "rmmod --wait", is completely bogus as support for "rmmod --wait", i.e. delete_module() without O_NONBLOCK, was removed by commit 3f2b9c9cdf38 ("module: remove rmmod --wait option.") nearly a decade ago. It's still possible for try_get_module() to fail due to the module dying (more like being killed), as the module will be tagged MODULE_STATE_GOING by "rmmod --force", i.e. delete_module(..., O_TRUNC), but playing nice with forced unloading is an exercise in futility and gives a falsea sense of security. Using try_get_module() only prevents acquiring _new_ references, it doesn't magically put the references held by other VMs, and forced unloading doesn't wait, i.e. "rmmod --force" on KVM is all but guaranteed to cause spectacular fireworks; the window where KVM will fail try_get_module() is tiny compared to the window where KVM is building and running the VM with an elevated module refcount. Addressing KVM's inability to play nice with "rmmod --force" is firmly out-of-scope. Forcefully unloading any module taints kernel (for obvious reasons) _and_ requires the kernel to be built with CONFIG_MODULE_FORCE_UNLOAD=y, which is off by default and comes with the amusing disclaimer that it's "mainly for kernel developers and desperate users". In other words, KVM is free to scoff at bug reports due to using "rmmod --force" while VMs may be running. Fixes: 5f6de5cbebee ("KVM: Prevent module exit until all VMs are freed") Cc: stable@vger.kernel.org Cc: David Matlack Signed-off-by: Sean Christopherson Message-Id: <20220816053937.2477106-3-seanjc@google.com> Signed-off-by: Paolo Bonzini Signed-off-by: Greg Kroah-Hartman

KVM: Don't set Accessed/Dirty bits for ZERO_PAGE

2022-08-17T12:23:44Z

[ Upstream commit a1040b0d42acf69bb4f6dbdc54c2dcd78eea1de5 ] Don't set Accessed/Dirty bits for a struct page with PG_reserved set, i.e. don't set A/D bits for the ZERO_PAGE. The ZERO_PAGE (or pages depending on the architecture) should obviously never be written, and similarly there's no point in marking it accessed as the page will never be swapped out or reclaimed. The comment in page-flags.h is quite clear that PG_reserved pages should be managed only by their owner, and strictly following that mandate also simplifies KVM's logic. Fixes: 7df003c85218 ("KVM: fix overflow of zero page refcount with ksm running") Signed-off-by: Sean Christopherson Message-Id: <20220429010416.2788472-4-seanjc@google.com> Signed-off-by: Paolo Bonzini Signed-off-by: Sasha Levin

KVM: Don't null dereference ops->destroy

2022-07-29T15:25:24Z

commit e8bc2427018826e02add7b0ed0fc625a60390ae5 upstream. A KVM device cleanup happens in either of two callbacks: 1) destroy() which is called when the VM is being destroyed; 2) release() which is called when a device fd is closed. Most KVM devices use 1) but Book3s's interrupt controller KVM devices (XICS, XIVE, XIVE-native) use 2) as they need to close and reopen during the machine execution. The error handling in kvm_ioctl_create_device() assumes destroy() is always defined which leads to NULL dereference as discovered by Syzkaller. This adds a checks for destroy!=NULL and adds a missing release(). This is not changing kvm_destroy_devices() as devices with defined release() should have been removed from the KVM devices list by then. Suggested-by: Paolo Bonzini Signed-off-by: Alexey Kardashevskiy Signed-off-by: Paolo Bonzini Signed-off-by: Greg Kroah-Hartman