summaryrefslogtreecommitdiff
path: root/drivers/gpu/drm/amd/amdgpu
AgeCommit message (Collapse)Author
32 hoursMerge tag 'mm-nonmm-stable-2026-02-12-10-48' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: - "ocfs2: give ocfs2 the ability to reclaim suballocator free bg" saves disk space by teaching ocfs2 to reclaim suballocator block group space (Heming Zhao) - "Add ARRAY_END(), and use it to fix off-by-one bugs" adds the ARRAY_END() macro and uses it in various places (Alejandro Colomar) - "vmcoreinfo: support VMCOREINFO_BYTES larger than PAGE_SIZE" makes the vmcore code future-safe, if VMCOREINFO_BYTES ever exceeds the page size (Pnina Feder) - "kallsyms: Prevent invalid access when showing module buildid" cleans up kallsyms code related to module buildid and fixes an invalid access crash when printing backtraces (Petr Mladek) - "Address page fault in ima_restore_measurement_list()" fixes a kexec-related crash that can occur when booting the second-stage kernel on x86 (Harshit Mogalapalli) - "kho: ABI headers and Documentation updates" updates the kexec handover ABI documentation (Mike Rapoport) - "Align atomic storage" adds the __aligned attribute to atomic_t and atomic64_t definitions to get natural alignment of both types on csky, m68k, microblaze, nios2, openrisc and sh (Finn Thain) - "kho: clean up page initialization logic" simplifies the page initialization logic in kho_restore_page() (Pratyush Yadav) - "Unload linux/kernel.h" moves several things out of kernel.h and into more appropriate places (Yury Norov) - "don't abuse task_struct.group_leader" removes the usage of ->group_leader when it is "obviously unnecessary" (Oleg Nesterov) - "list private v2 & luo flb" adds some infrastructure improvements to the live update orchestrator (Pasha Tatashin) * tag 'mm-nonmm-stable-2026-02-12-10-48' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (107 commits) watchdog/hardlockup: simplify perf event probe and remove per-cpu dependency procfs: fix missing RCU protection when reading real_parent in do_task_stat() watchdog/softlockup: fix sample ring index wrap in need_counting_irqs() kcsan, compiler_types: avoid duplicate type issues in BPF Type Format kho: fix doc for kho_restore_pages() tests/liveupdate: add in-kernel liveupdate test liveupdate: luo_flb: introduce File-Lifecycle-Bound global state liveupdate: luo_file: Use private list list: add kunit test for private list primitives list: add primitives for private list manipulations delayacct: fix uapi timespec64 definition panic: add panic_force_cpu= parameter to redirect panic to a specific CPU netclassid: use thread_group_leader(p) in update_classid_task() RDMA/umem: don't abuse current->group_leader drm/pan*: don't abuse current->group_leader drm/amd: kill the outdated "Only the pthreads threading model is supported" checks drm/amdgpu: don't abuse current->group_leader android/binder: use same_thread_group(proc->tsk, current) in binder_mmap() android/binder: don't abuse current->group_leader kho: skip memoryless NUMA nodes when reserving scratch areas ...
2 daysMerge tag 'drm-next-2026-02-11' of https://gitlab.freedesktop.org/drm/kernelLinus Torvalds
Pull drm updates from Dave Airlie: "Highlights: - amdgpu support for lots of new IP blocks which means newer GPUs - xe has a lot of SR-IOV and SVM improvements - lots of intel display refactoring across i915/xe - msm has more support for gen8 platforms - Given up on kgdb/kms integration, it's too hard on modern hw core: - drop kgdb support - replace system workqueue with percpu - account for property blobs in memcg - MAINTAINERS updates for xe + buddy rust: - Fix documentation for Registration constructors - Use pin_init::zeroed() for fops initialization - Annotate DRM helpers with __rust_helper - Improve safety documentation for gem::Object::new() - Update AlwaysRefCounted imports - mm: Prevent integer overflow in page_align() atomic: - add drm_device pointer to drm_private_obj - introduce gamma/degamma LUT size check buddy: - fix free_trees memory leak - prevent BUG_ON bridge: - introduce drm_bridge_unplug/enter/exit - add connector argument to .hpd_notify - lots of recounting conversions - convert rockchip inno hdmi to bridge - lontium-lt9611uxc: switch to HDMI audio helpers - dw-hdmi-qp: add support for HPD-less setups - Algoltek AG6311 support panels: - edp: CSW MNE007QB3-1, AUO B140HAN06.4, AUO B140QAX01.H - st75751: add SPI support - Sitronix ST7920, Samsung LTL106HL02 - LG LH546WF1-ED01, HannStar HSD156J - BOE NV130WUM-T08 - Innolux G150XGE-L05 - Anbernic RG-DS dma-buf: - improve sg_table debugging - add tracepoints - call clear_page instead of memset - start to introduce cgroup memory accounting in heaps - remove sysfs stats dma-fence: - add new helpers dp: - mst: avoid oob access with vcpi=0 hdmi: - limit infoframes exposure to userspace gem: - reduce page table overhead with THP - fix leak in drm_gem_get_unmapped_area gpuvm: - API sanitation for rust bindings sched: - introduce new helpers panic: - report invalid panic modes - add kunit tests i915/xe display: - Expose sharpness only if num_scalers is >= 2 - Add initial Xe3P_LPD for NVL - BMG FBC support - Add MTL+ platforms to support dpll framework _ fix DIMM_S DRM decoding on ICL - Return to using AUX interrupts - PSR/Panel replay refactoring - use consolidation HDMI tables - Xe3_LPD CD2X dividier changes xe: - vfio: add vfio_pci for intel GPU - multi queue support - dynamic pagemaps and multi-device SVM - expose temp attribs in hwmon - NO_COMPRESSION bo flag - expose MERT OA unit - sysfs survivability refactor - SRIOV PF: add MERT support - enable SR-IOV VF migration - Enable I2C/NVM on Crescent Island - Xe3p page reclaimation support - introduce SRIOV scheduler groups - add SoC remappt support in system controller - insert compiler barriers in GuC code - define NVL GuC firmware - handle GT resume failure - fix drm scheduler layering violations - enable GSC loading and PXP for PTL - disable GuC Power DCC strategy on PTL - unregister drm device on probe error i915: - move to kernel standard fault injection - bump recommended GuC version for DG2 and MTL amdgpu: - SMUIO 15.x, PSP 15.x support - IH 6.1.1/7.1 support - MMHUB 3.4/4.2 support - GC 11.5.4/12.1 support - SDMA 6.1.4/7.1/7.11.4 support - JPEG 5.3 support - UserQ updates - GC 9 gfx queue reset support - TTM memory ops parallelization - convert legacy logging to new helpers - DC analog fixes amdkfd: - GC 11.5.4/12.1 suppport - SDMA 6.1.4/7.1 support - per context support - increase kfd process hash table - Reserved SDMA rework radeon: - convert legacy logging to new helpers - use devm for i2c adapters msm: - GPU - Document a612/RGMU dt bindings - UBWC 6.0 support (for A840 / Kaanapali) - a225 support - DPU: - Switch to use virtual planes by default - Fix DSI CMD panels on DPU 3.x - Rewrite format handling to remove intermediate representation - Fix watchdog on DPU 8.x+ - Fix TE / Vsync source setting on DPU 8.x+ - Add 3D_Mux on SC7280 - Kaanapali platform support - Fix UBWC register programming - Make RM reserve DSPP-enabled mixers for CRTCs with LMs - Gamma correction support - DP: - Enable support for eDP 1.4+ link rate tables - Fix MDSS1 DP indices on SA8775P, making them to work - Fix msm_dp_ctrl_config_msa() to work with LLVM 20 - DSI: - Document QCS8300 as compatible with SA8775P - Kaanapali platform support - DSI PHY: - switch to divider_determine_rate() - MDP5: - Drop support for MSM8998, SDM660 and SDM630 (switch over to DPU) - MDSS: - Kaanapali platform support - Fixed UBWC register programming nova-core: - Prepare for Turing support. This includes parsing and handling Turing-specific firmware headers and sections as well as a Turing Falcon HAL implementation - Get rid of the Result<impl PinInit<T, E>> anti-pattern - Relocate initializer-specific code into the appropriate initializer - Use CStr::from_bytes_until_nul() to remove custom helpers - Improve handling of unexpected firmware values - Clean up redundant debug prints - Replace c_str!() with native Rust C-string literals - Update nova-core task list nova: - Align GEM object size to system page size tyr: - Use generated uAPI bindings for GpuInfo - Replace manual sleeps with read_poll_timeout() - Replace c_str!() with native Rust C-string literals - Suppress warnings for unread fields - Fix incorrect register name in print statement nouveau: - fix big page table support races in PTE management - improve reclocking on tegra 186+ amdxdna: - fix suspend race conditions - improve handling of zero tail pointers - fix cu_idx overwritten during command setup - enable hardware context priority - remove NPU2 support - update message buffer allocation requirements - update firmware version check ast: - support imported cursor buffers - big endian fixes etnaviv: - add PPU flop reset support imagination: - add AM62P support - introduce hw version checks ivpu: - implement warm boot flow panfrost: - add bo sync ioctl - add GPU_PM_RT support for RZ/G3E SoC panthor: - add bo sync ioctl - enable timestamp propagation - scheduler robustness improvements - VM termination fixes - huge page support rockchip: - RK3368 HDMI Support - get rid of atomic_check fixups - RK3506 support - RK3576/RK3588 improved HPD handling rz-du: - RZ/V2H(P) MIPI-DSI Support v3d: - fix DMA segment size - convert to new logging helpers mediatek: - move DP training to hotplug thread - convert logging to new helpers - add support for HS speed DSI - Genio 510/700/1200-EVK, Radxa NIO-12L HDMI support atmel-hlcdc: - switch to drmm resource - support nomodeset - use newer helpers hisilicon: - fix various DP bugs renesas: - fix kernel panic on reboot exynos: - fix vidi_connection_ioctl using wrong device - fix vidi_connection deref user ptr - fix concurrency regression with vidi_context vkms: - add configfs support for display configuration * tag 'drm-next-2026-02-11' of https://gitlab.freedesktop.org/drm/kernel: (1610 commits) drm/xe/pm: Disable D3Cold for BMG only on specific platforms drm/xe: Fix kerneldoc for xe_tlb_inval_job_alloc_dep drm/xe: Fix kerneldoc for xe_gt_tlb_inval_init_early drm/xe: Fix kerneldoc for xe_migrate_exec_queue drm/xe/query: Fix topology query pointer advance drm/xe/guc: Fix kernel-doc warning in GuC scheduler ABI header drm/xe/guc: Fix CFI violation in debugfs access. accel/amdxdna: Move RPM resume into job run function accel/amdxdna: Fix incorrect DPM level after suspend/resume nouveau/vmm: start tracking if the LPT PTE is valid. (v6) nouveau/vmm: increase size of vmm pte tracker struct to u32 (v2) nouveau/vmm: rewrite pte tracker using a struct and bitfields. accel/amdxdna: Fix incorrect error code returned for failed chain command accel/amdxdna: Remove hardware context status drm/bridge: imx8qxp-pixel-combiner: Fix bailout for imx8qxp_pc_bridge_probe() drm/panel: ilitek-ili9882t: Remove duplicate initializers in tianma_il79900a_dsc drm/i915/display: fix the pixel normalization handling for xe3p_lpd drm/exynos: vidi: use ctx->lock to protect struct vidi_context member variables related to memory alloc/free drm/exynos: vidi: fix to avoid directly dereferencing user pointer drm/exynos: vidi: use priv->vidi_dev for ctx lookup in vidi_connection_ioctl() ...
10 daysdrm/amdgpu: Fix double deletion of validate_listHarish Kasiviswanathan
If amdgpu_amdkfd_gpuvm_free_memory_of_gpu() fails after kgd_mem is removed from validate_list, the mem handle still lingers in the KFD idr. This means when process is terminated, kfd_process_free_outstanding_kfd_bos() will call amdgpu_amdkfd_gpuvm_free_memory_of_gpu() again resulting in double deletion. To avoid this - (a) Check if list is empty before deleting it (b) Rearragne amdgpu_amdkfd_gpuvm_free_memory_of_gpu() such that it can be safely called again if it returns failure the first time. Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com> Reviewed-by: Philip Yang <Philip.Yang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 6ba60345f45eaf7cb4f89105d26083a4b9fd1cba)
10 daysRevert "drm/amd: Check if ASPM is enabled from PCIe subsystem"Bert Karwatzki
This reverts commit 7294863a6f01248d72b61d38478978d638641bee. This commit was erroneously applied again after commit 0ab5d711ec74 ("drm/amd: Refactor `amdgpu_aspm` to be evaluated per device") removed it, leading to very hard to debug crashes, when used with a system with two AMD GPUs of which only one supports ASPM. Link: https://lore.kernel.org/linux-acpi/20251006120944.7880-1-spasswolf@web.de/ Link: https://github.com/acpica/acpica/issues/1060 Fixes: 0ab5d711ec74 ("drm/amd: Refactor `amdgpu_aspm` to be evaluated per device") Signed-off-by: Bert Karwatzki <spasswolf@web.de> Reviewed-by: Christian König <christian.koenig@amd.com> Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 97a9689300eb2b393ba5efc17c8e5db835917080) Cc: stable@vger.kernel.org
10 daysdrm/amd: Set minimum version for set_hw_resource_1 on gfx11 to 0x52Mario Limonciello
commit f81cd793119e ("drm/amd/amdgpu: Fix MES init sequence") caused a dependency on new enough MES firmware to use amdgpu. This was fixed on most gfx11 and gfx12 hardware with commit 0180e0a5dd5c ("drm/amdgpu/mes: add compatibility checks for set_hw_resource_1"), but this left out that GC 11.0.4 had breakage at MES 0x51. Bump the requirement to 0x52 instead. Reported-by: danijel@nausys.com Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4576 Fixes: f81cd793119e ("drm/amd/amdgpu: Fix MES init sequence") Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit c2d2ccc85faf8cc6934d50c18e43097eb453ade2) Cc: stable@vger.kernel.org
11 daysdrm/amd: kill the outdated "Only the pthreads threading model is supported" ↵Oleg Nesterov
checks Nowadays task->group_leader->mm != task->mm is only possible if a) task is not a group leader and b) task->group_leader->mm == NULL because task->group_leader has already exited using sys_exit(). I don't think that drm/amd tries to detect/nack this case. Link: https://lkml.kernel.org/r/aXY_yLVHd63UlWtm@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Christan König <christian.koenig@amd.com> Acked-by: Felix Kuehling <felix.kuehling@amd.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Boris Brezillon <boris.brezillon@collabora.com> Cc: David S. Miller <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Leon Romanovsky <leon@kernel.org> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Simon Horman <horms@kernel.org> Cc: Steven Price <steven.price@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 daysdrm/amdgpu: don't abuse current->group_leaderOleg Nesterov
Cleanup and preparation to simplify the next changes. - Use current->tgid instead of current->group_leader->pid - Use get_task_pid(current, PIDTYPE_TGID) instead of get_task_pid(current->group_leader, PIDTYPE_PID) Link: https://lkml.kernel.org/r/aXY_wKewzV5lCa5I@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Felix Kuehling <felix.kuehling@amd.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Boris Brezillon <boris.brezillon@collabora.com> Cc: Christan König <christian.koenig@amd.com> Cc: David S. Miller <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Leon Romanovsky <leon@kernel.org> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Simon Horman <horms@kernel.org> Cc: Steven Price <steven.price@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 daysMerge tag 'amd-drm-next-6.20-2026-01-30' of ↵Dave Airlie
https://gitlab.freedesktop.org/agd5f/linux into drm-next amd-drm-next-6.20-2026-01-30: amdgpu: - Misc cleanups - SMU 13 fixes - SMU 14 fixes - GPUVM fault filter fix - USB4 fixes - DC FP guard fixes - Powergating fix - JPEG ring reset fix - RAS fixes - Xclk fix for soc21 APUs - Fix COND_EXEC handling for GC 11 - UserQ fixes - MQD size alignment fixes - SMU feature interface cleanup - GC 10-12 KGQ init fixes - GC 11-12 KGQ reset fixes amdkfd: - Fix device snapshot reporting - GC 12.1 trap handler fixes - MQD size alignment fixes Signed-off-by: Dave Airlie <airlied@redhat.com> From: Alex Deucher <alexander.deucher@amd.com> Link: https://patch.msgid.link/20260130183257.28879-1-alexander.deucher@amd.com
2026-01-29drm/amdgpu/gfx12: adjust KGQ reset sequenceAlex Deucher
Kernel gfx queues do not need to be reinitialized or remapped after a reset. Align with gfx11. v2: preserve init and remap for MMIO case. Reviewed-by: Timur Kristóf <timur.kristof@gmail.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 0a6d6ed694d72b66b0ed7a483d5effa01acd3951) Cc: stable@vger.kernel.org
2026-01-29drm/amdgpu/gfx11: adjust KGQ reset sequenceAlex Deucher
Kernel gfx queues do not need to be reinitialized or remapped after a reset. This fixes queue reset failures on APUs. v2: preserve init and remap for MMIO case. Fixes: b3e9bfd86658 ("drm/amdgpu/gfx11: add ring reset callbacks") Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4789 Reviewed-by: Timur Kristóf <timur.kristof@gmail.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit b340ff216fdabfe71ba0cdd47e9835a141d08e10) Cc: stable@vger.kernel.org
2026-01-29drm/amdgpu/gfx12: fix wptr reset in KGQ initAlex Deucher
wptr is a 64 bit value and we need to update the full value, not just 32 bits. Align with what we already do for KCQs. Reviewed-by: Timur Kristóf <timur.kristof@gmail.com> Reviewed-by: Jesse Zhang <jesse.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit a2918f958d3f677ea93c0ac257cb6ba69b7abb7c) Cc: stable@vger.kernel.org
2026-01-29drm/amdgpu/gfx11: fix wptr reset in KGQ initAlex Deucher
wptr is a 64 bit value and we need to update the full value, not just 32 bits. Align with what we already do for KCQs. Reviewed-by: Timur Kristóf <timur.kristof@gmail.com> Reviewed-by: Jesse Zhang <jesse.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 1f16866bdb1daed7a80ca79ae2837a9832a74fbc) Cc: stable@vger.kernel.org
2026-01-29drm/amdgpu/gfx10: fix wptr reset in KGQ initAlex Deucher
wptr is a 64 bit value and we need to update the full value, not just 32 bits. Align with what we already do for KCQs. Reviewed-by: Timur Kristóf <timur.kristof@gmail.com> Reviewed-by: Jesse Zhang <jesse.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit e80b1d1aa1073230b6c25a1a72e88f37e425ccda) Cc: stable@vger.kernel.org
2026-01-29drm/amdgpu/gfx12: adjust KGQ reset sequenceAlex Deucher
Kernel gfx queues do not need to be reinitialized or remapped after a reset. Align with gfx11. v2: preserve init and remap for MMIO case. Reviewed-by: Timur Kristóf <timur.kristof@gmail.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-29drm/amdgpu/gfx11: adjust KGQ reset sequenceAlex Deucher
Kernel gfx queues do not need to be reinitialized or remapped after a reset. This fixes queue reset failures on APUs. v2: preserve init and remap for MMIO case. Fixes: b3e9bfd86658 ("drm/amdgpu/gfx11: add ring reset callbacks") Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4789 Reviewed-by: Timur Kristóf <timur.kristof@gmail.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-29drm/amdgpu/gfx12: fix wptr reset in KGQ initAlex Deucher
wptr is a 64 bit value and we need to update the full value, not just 32 bits. Align with what we already do for KCQs. Reviewed-by: Timur Kristóf <timur.kristof@gmail.com> Reviewed-by: Jesse Zhang <jesse.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-29drm/amdgpu/gfx11: fix wptr reset in KGQ initAlex Deucher
wptr is a 64 bit value and we need to update the full value, not just 32 bits. Align with what we already do for KCQs. Reviewed-by: Timur Kristóf <timur.kristof@gmail.com> Reviewed-by: Jesse Zhang <jesse.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-29drm/amdgpu/gfx10: fix wptr reset in KGQ initAlex Deucher
wptr is a 64 bit value and we need to update the full value, not just 32 bits. Align with what we already do for KCQs. Reviewed-by: Timur Kristóf <timur.kristof@gmail.com> Reviewed-by: Jesse Zhang <jesse.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-29drm/amdgpu: Use AMDGPU_MQD_SIZE_ALIGN in KGDLang Yu
Use AMDGPU_MQD_SIZE_ALIGN for both kernel and user queue. Signed-off-by: Lang Yu <lang.yu@amd.com> Reviewed-by: David Belanger <david.belanger@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Mukul Joshi <mukul.joshi@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-29drm/amdgpu: Add a helper macro to align mqd sizeLang Yu
MES FW uses address(mqd_addr + sizeof(struct mqd) + 3*sizeof(uint32_t)) as fence address and writes a 32 bit fence value to this address. Driver needs to allocate some extra memory(at least 4 DWs) in addition to sizeof(struct mqd) as mqd memory(limited to gfx/compute/sdma queue). For gfx11/12, sizeof(struct mqd) < PAGE_SIZE, KGD allocates mqd memory with PAGE_SIZE aligned works. For gfx12.1, sizeof(struct mqd) == PAGE_SIZE, it doesn't work. KFD mqd manager hardcodes mqd size to PAGE_SIZE/MQD_SIZE across different IP versions to solve this issue. To avoid hardcoding in differnet places and across different IP versions. Let's use AMDGPU_MQD_SIZE_ALIGN instead. It is used in two places. 1. mqd memory alloction 2. mqd stride handling for multi xcc config v2: Use AMDGPU_GPU_PAGE_ALIGN. (Mukul) Signed-off-by: Lang Yu <lang.yu@amd.com> Reviewed-by: David Belanger <david.belanger@amd.com> (v1) Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Mukul Joshi <mukul.joshi@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-29drm/amdgpu: validate user queue size constraintsJesse.Zhang
Add validation to ensure user queue sizes meet hardware requirements: - Size must be a power of two for efficient ring buffer wrapping - Size must be at least AMDGPU_GPU_PAGE_SIZE to prevent undersized allocations This prevents invalid configurations that could lead to GPU faults or unexpected behavior. Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Jesse Zhang <jesse.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-28drm/amdgpu: Fix cond_exec handling in amdgpu_ib_schedule()Alex Deucher
The EXEC_COUNT field must be > 0. In the gfx shadow handling we always emit a cond_exec packet after the gfx_shadow packet, but the EXEC_COUNT never gets patched. This leads to a hang when we try and reset queues on gfx11 APUs. Fixes: c68cbbfd54c6 ("drm/amdgpu: cleanup conditional execution") Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4789 Reviewed-by: Jesse Zhang <Jesse.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit ba205ac3d6e83f56c4f824f23f1b4522cb844ff3) Cc: stable@vger.kernel.org
2026-01-28drm/amdgpu/soc21: fix xclk for APUsAlex Deucher
The reference clock is supposed to be 100Mhz, but it appears to actually be slightly lower (99.81Mhz). Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/14451 Reviewed-by: Jesse Zhang <Jesse.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 637fee3954d4bd509ea9d95ad1780fc174489860) Cc: stable@vger.kernel.org
2026-01-28drm/amdgpu: Fix cond_exec handling in amdgpu_ib_schedule()Alex Deucher
The EXEC_COUNT field must be > 0. In the gfx shadow handling we always emit a cond_exec packet after the gfx_shadow packet, but the EXEC_COUNT never gets patched. This leads to a hang when we try and reset queues on gfx11 APUs. Fixes: c68cbbfd54c6 ("drm/amdgpu: cleanup conditional execution") Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4789 Reviewed-by: Jesse Zhang <Jesse.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-28drm/amdgpu/soc21: fix xclk for APUsAlex Deucher
The reference clock is supposed to be 100Mhz, but it appears to actually be slightly lower (99.81Mhz). Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/14451 Reviewed-by: Jesse Zhang <Jesse.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-28BackMerge tag 'v6.19-rc7' into drm-nextDave Airlie
Linux 6.19-rc7 This is needed for msm and rust trees. Signed-off-by: Dave Airlie <airlied@redhat.com>
2026-01-27drm/amdgpu: fix NULL pointer dereference in amdgpu_gmc_filter_faults_removeJon Doron
On APUs such as Raven and Renoir (GC 9.1.0, 9.2.2, 9.3.0), the ih1 and ih2 interrupt ring buffers are not initialized. This is by design, as these secondary IH rings are only available on discrete GPUs. See vega10_ih_sw_init() which explicitly skips ih1/ih2 initialization when AMD_IS_APU is set. However, amdgpu_gmc_filter_faults_remove() unconditionally uses ih1 to get the timestamp of the last interrupt entry. When retry faults are enabled on APUs (noretry=0), this function is called from the SVM page fault recovery path, resulting in a NULL pointer dereference when amdgpu_ih_decode_iv_ts_helper() attempts to access ih->ring[]. The crash manifests as: BUG: kernel NULL pointer dereference, address: 0000000000000004 RIP: 0010:amdgpu_ih_decode_iv_ts_helper+0x22/0x40 [amdgpu] Call Trace: amdgpu_gmc_filter_faults_remove+0x60/0x130 [amdgpu] svm_range_restore_pages+0xae5/0x11c0 [amdgpu] amdgpu_vm_handle_fault+0xc8/0x340 [amdgpu] gmc_v9_0_process_interrupt+0x191/0x220 [amdgpu] amdgpu_irq_dispatch+0xed/0x2c0 [amdgpu] amdgpu_ih_process+0x84/0x100 [amdgpu] This issue was exposed by commit 1446226d32a4 ("drm/amdgpu: Remove GC HW IP 9.3.0 from noretry=1") which changed the default for Renoir APU from noretry=1 to noretry=0, enabling retry fault handling and thus exercising the buggy code path. Fix this by adding a check for ih1.ring_size before attempting to use it. Also restore the soft_ih support from commit dd299441654f ("drm/amdgpu: Rework retry fault removal"). This is needed if the hardware doesn't support secondary HW IH rings. v2: additional updates (Alex) Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3814 Fixes: dd299441654f ("drm/amdgpu: Rework retry fault removal") Reviewed-by: Timur Kristóf <timur.kristof@gmail.com> Reviewed-by: Philip Yang <Philip.Yang@amd.com> Signed-off-by: Jon Doron <jond@wiz.io> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 6ce8d536c80aa1f059e82184f0d1994436b1d526) Cc: stable@vger.kernel.org
2026-01-27drm/amdgpu: Send RMA CPER at bad page loadingKent Russell
Some older builds weren't sending RMA CPERs when the bad page threshold was exceeded. Newer builds have resolved this, but there could be systems out there with bad page numbers higher than the threshold, that haven't sent out an RMA CPER. To be thorough and safe, send an RMA CPER when we load the table, if the threshold is met or exceeded, instead of waiting for the next UE to trigger the CPER. Signed-off-by: Kent Russell <kent.russell@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-27drm/amdgpu: Fix jpeg ring test order in vcn_v4_0_3Jesse.Zhang
Fix the vcn reset sequence in vcn_v4_0_3_ring_reset() to restore JPEG power state and unlock the JPEG powergating mutex before running the JPEG ring post-reset helper. Fixes: d25c67fd9d6f ("drm/amdgpu/vcn4.0.3: rework reset handling") Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Jesse Zhang <jesse.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-27drm/amdgpu: fix NULL pointer dereference in amdgpu_gmc_filter_faults_removeJon Doron
On APUs such as Raven and Renoir (GC 9.1.0, 9.2.2, 9.3.0), the ih1 and ih2 interrupt ring buffers are not initialized. This is by design, as these secondary IH rings are only available on discrete GPUs. See vega10_ih_sw_init() which explicitly skips ih1/ih2 initialization when AMD_IS_APU is set. However, amdgpu_gmc_filter_faults_remove() unconditionally uses ih1 to get the timestamp of the last interrupt entry. When retry faults are enabled on APUs (noretry=0), this function is called from the SVM page fault recovery path, resulting in a NULL pointer dereference when amdgpu_ih_decode_iv_ts_helper() attempts to access ih->ring[]. The crash manifests as: BUG: kernel NULL pointer dereference, address: 0000000000000004 RIP: 0010:amdgpu_ih_decode_iv_ts_helper+0x22/0x40 [amdgpu] Call Trace: amdgpu_gmc_filter_faults_remove+0x60/0x130 [amdgpu] svm_range_restore_pages+0xae5/0x11c0 [amdgpu] amdgpu_vm_handle_fault+0xc8/0x340 [amdgpu] gmc_v9_0_process_interrupt+0x191/0x220 [amdgpu] amdgpu_irq_dispatch+0xed/0x2c0 [amdgpu] amdgpu_ih_process+0x84/0x100 [amdgpu] This issue was exposed by commit 1446226d32a4 ("drm/amdgpu: Remove GC HW IP 9.3.0 from noretry=1") which changed the default for Renoir APU from noretry=1 to noretry=0, enabling retry fault handling and thus exercising the buggy code path. Fix this by adding a check for ih1.ring_size before attempting to use it. Also restore the soft_ih support from commit dd299441654f ("drm/amdgpu: Rework retry fault removal"). This is needed if the hardware doesn't support secondary HW IH rings. v2: additional updates (Alex) Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3814 Fixes: dd299441654f ("drm/amdgpu: Rework retry fault removal") Reviewed-by: Timur Kristóf <timur.kristof@gmail.com> Reviewed-by: Philip Yang <Philip.Yang@amd.com> Signed-off-by: Jon Doron <jond@wiz.io> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-27drm/amdgpu: Simplify sorting of the bo listTvrtko Ursulin
Sort function only cares about the sign so we can replace the conditionals with a single subtraction. Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-27drm/amdgpu/mes: Remove idr leftovers v2Tvrtko Ursulin
Commit cb17fff3a254 ("drm/amdgpu/mes: remove unused functions") removed most of the code using these IDRs but forgot to remove the struct members and init/destroy paths. There is also interrupt handling code in SDMA 5.0 and 5.2 which appears to be using it, but is is unreachable since nothing ever allocates the relevant IDR. We replace those with one time warnings just to avoid any functional difference, but it is also possible they should be removed. v2: also fix up gfx_v12_1.c and sdma_v7_1.c Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com> References: cb17fff3a254 ("drm/amdgpu/mes: remove unused functions") Cc: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-21drm/amdgpu: fix type for wptr in ring backupAlex Deucher
Needs to be a u64. Fixes: 77cc0da39c7c ("drm/amdgpu: track ring state associated with a fence") Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 56fff1941abd3ca3b6f394979614ca7972552f7f)
2026-01-21drm/amdgpu: Fix validating flush_gpu_tlb_pasid()Timur Kristóf
When a function holds a lock and we return without unlocking it, it deadlocks the kernel. We should always unlock before returning. This commit fixes suspend/resume on SI. Tested on two Tahiti GPUs: FirePro W9000 and R9 280X. Fixes: f4db9913e4d3 ("drm/amdgpu: validate the flush_gpu_tlb_pasid()") Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/r/202601190121.z9C0uml5-lkp@intel.com/ Signed-off-by: Timur Kristóf <timur.kristof@gmail.com> Signed-off-by: Prike Liang <Prike.Liang@amd.com> Reviewed-by: Prike Liang <Prike.Liang@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit e3a6eff92bbd960b471966d9afccb4d584546d17)
2026-01-21drm/amdgpu: rename amdgpu_fence_driver_guilty_force_completion()Alex Deucher
The function no longer signals the fence so rename it to better match what it does. Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-21drm/amdgpu: fix type for wptr in ring backupAlex Deucher
Needs to be a u64. Fixes: 77cc0da39c7c ("drm/amdgpu: track ring state associated with a fence") Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-21drm/amdgpu: mark invalid records with U64_MAXGangliang Xie
set retired_page of invalid ras records to U64_MAX, and skip them when reading ras records Signed-off-by: Gangliang Xie <ganglxie@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-21drm/amdgpu: Avoid excessive dmesg logLijo Lazar
KIQ access is not guaranteed to work reliably under all reset situations. Avoid flooding dmesg with HDP flush failure messages. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-21drm/amdgpu: Fix validating flush_gpu_tlb_pasid()Timur Kristóf
When a function holds a lock and we return without unlocking it, it deadlocks the kernel. We should always unlock before returning. This commit fixes suspend/resume on SI. Tested on two Tahiti GPUs: FirePro W9000 and R9 280X. Fixes: f4db9913e4d3 ("drm/amdgpu: validate the flush_gpu_tlb_pasid()") Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/r/202601190121.z9C0uml5-lkp@intel.com/ Signed-off-by: Timur Kristóf <timur.kristof@gmail.com> Signed-off-by: Prike Liang <Prike.Liang@amd.com> Reviewed-by: Prike Liang <Prike.Liang@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-21drm/amdgpu/vcn5.0.1: rework reset handlingJesse.Zhang
Resetting VCN resets the entire tile, including jpeg. When resetting the VCN, we need to ensure that JPEG data blocks are accessible and we also need to handle the JPEG queue. Add a helper function to restore the JPEG queue during the VCN reset. v2: split the jpeg helper in two, in the top helper we can stop the sched workqueues and attempt to wait for any outstanding fences. Then in the bottom helper, we can force completion, re-init the rings, and restart the sched workqueues (Alex) v3: merge patches 4 and 5 into one patch (Alex) Signed-off-by: Jesse Zhang <jesse.zhang@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-21drm/amdgpu/vcn4.0.3: rework reset handlingJesse.Zhang
Resetting VCN resets the entire tile, including jpeg. When resetting the VCN, we need to ensure that JPEG data blocks are accessible and we also need to handle the JPEG queue. Add a helper function to restore the JPEG queue during the VCN reset. v2: split the jpeg helper in two, in the top helper we can stop the sched workqueues and attempt to wait for any outstanding fences. Then in the bottom helper, we can force completion, re-init the rings, and restart the sched workqueues (Alex) v3: merge patches 1 and 2 into one patch (Alex) Signed-off-by: Jesse Zhang <jesse.zhang@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-21drm/amdgpu/vcn4.0.3: implement DPG pause mode handling for VCN 4.0.3Jesse.Zhang
For MI projects, when Dynamic Power Gating (DPG) is enabled, VCN reset operations should be performed with DPG in pause mode. Otherwise, the hardware may perform undesirable reset operations Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Jesse Zhang <jesse.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-20drm/amdgpu: fix error handling in ib_schedule()Alex Deucher
If fence emit fails, free the fence if necessary. Fixes: db36632ea51e ("drm/amdgpu: clean up and unify hw fence handling") Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 5eb680a06007f2f6ea333d11a4e29039da90614b)
2026-01-20drm/amdgpu: free hw_vm_fence when fail in amdgpu_job_allocJiqian Chen
If drm_sched_job_init fails, hw_vm_fence is not freed currently, then cause memory leak. Fixes: db36632ea51e ("drm/amdgpu: clean up and unify hw fence handling") Link: https://lore.kernel.org/amd-gfx/a5a828cb-0e4a-41f0-94c3-df31e5ddad52@amd.com/T/#t Signed-off-by: Jiqian Chen <Jiqian.Chen@amd.com> Reviewed-by: Amos Kong <kongjianjun@gmail.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 5d42ee457ccd1fb5da4c7f817825b2806ec36956)
2026-01-20drm/amdgpu: remove frame cntl for gfx v12Likun Gao
Remove emit_frame_cntl function for gfx v12, which is not support. Signed-off-by: Likun Gao <Likun.Gao@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 5aaa5058dec5bfdcb24c42fe17ad91565a3037ca) Cc: stable@vger.kernel.org
2026-01-20drm/amdgpu: add new job idsAlex Deucher
Use this for gfx, sdma, vpe IB tests and kernel shaders. The end goal it to get rid of the direct IB submit without a job structure. Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-20drm/amdgpu: fix error handling in ib_schedule()Alex Deucher
If fence emit fails, free the fence if necessary. Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-20drm/amdgpu/jpeg4.0.3: remove redundant sr-iov checkAlex Deucher
The per queue reset flag is only set when sr-iov is disabled so this check is not necessary as the function will never be called on sr-iov. Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-20drm/amdgpu: Improve IP discovery checksum failure loggingPerry Yuan
Enhance the error logging in amdgpu_discovery_verify_checksum() to print the calculated checksum, the expected checksum, the data size. This extra context helps quickly identify if the issue is a data corruption, a partially read binary, or an invalid table header without requiring additional instrumentation. Signed-off-by: Perry Yuan <perry.yuan@amd.com> Reviewed-by: Yifan Zhang <yifan1.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-20drm/amdgpu: free hw_vm_fence when fail in amdgpu_job_allocJiqian Chen
If drm_sched_job_init fails, hw_vm_fence is not freed currently, then cause memory leak. Fixes: db36632ea51e ("drm/amdgpu: clean up and unify hw fence handling") Link: https://lore.kernel.org/amd-gfx/a5a828cb-0e4a-41f0-94c3-df31e5ddad52@amd.com/T/#t Signed-off-by: Jiqian Chen <Jiqian.Chen@amd.com> Reviewed-by: Amos Kong <kongjianjun@gmail.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>