user/sven/linux.git/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c, branch v6.11.8

drm/amdkfd: amdkfd_free_gtt_mem clear the correct pointer

2024-10-10T10:03:24Z

[ Upstream commit c86ad39140bbcb9dc75a10046c2221f657e8083b ] Pass pointer reference to amdgpu_bo_unref to clear the correct pointer, otherwise amdgpu_bo_unref clear the local variable, the original pointer not set to NULL, this could cause use-after-free bug. Signed-off-by: Philip Yang Reviewed-by: Felix Kuehling Acked-by: Christian König Signed-off-by: Alex Deucher Signed-off-by: Sasha Levin

drm/amdkfd: add reset cause in gpu pre-reset smi event

2024-06-05T15:25:14Z

reset cause is requested by customer as additional info for gpu reset smi event. v2: integerate reset sources suggested by Lijo Lazar Signed-off-by: Eric Huang Reviewed-by: Lijo Lazar Signed-off-by: Alex Deucher

drm/amdkfd: Let VRAM allocations go to GTT domain on small APUs

2024-05-08T19:17:04Z

Small APUs(i.e., consumer, embedded products) usually have a small carveout device memory which can't satisfy most compute workloads memory allocation requirements. We can't even run a Basic MNIST Example with a default 512MB carveout. https://github.com/pytorch/examples/tree/main/mnist. Error Log: "torch.cuda.OutOfMemoryError: HIP out of memory. Tried to allocate 84.00 MiB. GPU 0 has a total capacity of 512.00 MiB of which 0 bytes is free. Of the allocated memory 103.83 MiB is allocated by PyTorch, and 22.17 MiB is reserved by PyTorch but unallocated" Though we can change BIOS settings to enlarge carveout size, which is inflexible and may bring complaint. On the other hand, the memory resource can't be effectively used between host and device. The solution is MI300A approach, i.e., let VRAM allocations go to GTT. Then device and host can flexibly and effectively share memory resource. v2: Report local_mem_size_private as 0. (Felix) Signed-off-by: Lang Yu Reviewed-by: Felix Kuehling Signed-off-by: Alex Deucher

drm/amdgpu: prepare to handle pasid poison consumption

2024-04-26T21:22:42Z

Prepare to handle pasid poison consumption. Signed-off-by: YiPeng Chai Reviewed-by: Tao Zhou Signed-off-by: Alex Deucher

drm/amdgpu: make reset method configurable for RAS poison

2024-03-20T17:38:15Z

Each RAS block has different requirement for gpu reset in poison consumption handling. Add support for mmhub RAS poison consumption handling. v2: remove the mmhub poison support for kfd int v10. Signed-off-by: Tao Zhou Reviewed-by: Hawking Zhang Signed-off-by: Alex Deucher

drm/amdgpu: support utcl2 RAS poison query for mmhub

2024-03-20T17:38:14Z

Support the query for both gfxhub and mmhub, also replace xcc_id with hub_inst. Signed-off-by: Tao Zhou Reviewed-by: Hawking Zhang Signed-off-by: Alex Deucher

drm/amdgpu: retire gfx ras query_utcl2_poison_status

2024-03-20T17:37:36Z

Replace it with related interface in gfxhub functions. v2: replace node id with xcc id. get node id for query_utcl2_poison_status Signed-off-by: Tao Zhou Reviewed-by: Hawking Zhang Signed-off-by: Alex Deucher

drm/amdgpu: Init zone device and drm client after mode-1 reset on reload

2024-03-20T17:12:57Z

In passthrough environment, when amdgpu is reloaded after unload, mode-1 is triggered after initializing the necessary IPs, That init does not include KFD, and KFD init waits until the reset is completed. KFD init is called in the reset handler, but in this case, the zone device and drm client is not initialized, causing app to create kernel panic. v2: Removing the init KFD condition from amdgpu_amdkfd_drm_client_create. As the previous version has the potential of creating DRM client twice. v3: v2 patch results in SDMA engine hung as DRM open causes VM clear to SDMA before SDMA init. Adding the condition to in drm client creation, on top of v1, to guard against drm client creation call multiple times. Signed-off-by: Ahmad Rehman Reviewed-by: Felix Kuehling Signed-off-by: Alex Deucher

drm/amdkfd: Skip packet submission on fatal error

2024-02-26T16:14:31Z

If fatal error is detected, packet submission won't go through. Return error in such cases. Also, avoid waiting for fence when fatal error is detected. Signed-off-by: Lijo Lazar Reviewed-by: Asad Kamal Signed-off-by: Alex Deucher

drm/amdgpu: move the drm client creation behind drm device registration

2024-01-29T20:35:13Z

This patch is to eliminate interrupt warning below: "[drm] Fence fallback timer expired on ring sdma0.0". An early vm pt clearing job is sent to SDMA ahead of interrupt enabled. And re-locating the drm client creation following after drm_dev_register looks like a more proper flow. v2: wrap the drm client creation Fixes: 1819200166ce ("drm/amdkfd: Export DMABufs from KFD using GEM handles") Signed-off-by: Le Ma Reviewed-by: Felix Kuehling Reviewed-by: Lijo Lazar Signed-off-by: Alex Deucher