user/sven/linux.git/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c, branch v6.11.8

drm/amdgpu: timely save bad pages to eeprom after gpu ras reset is completed

2024-07-10T14:13:41Z

The problem case is as follows: 1. GPU A triggers a gpu ras reset, and GPU A drives GPU B to also perform a gpu ras reset. 2. After gpu B ras reset started, gpu B queried a DE data. Since the DE data was queried in the ras reset thread instead of the page retirement thread, bad page retirement work would not be triggered. Then even if all gpu resets are completed, the bad pages will be cached in RAM until GPU B's bad page retirement work is triggered again and then saved to eeprom. This patch can save the bad pages to eeprom in time after gpu ras reset is completed. v2: 1. Add the above description to code comments. 2. Reuse existing function. Signed-off-by: YiPeng Chai Reviewed-by: Tao Zhou Signed-off-by: Alex Deucher

drm/amdgpu: flush all cached ras bad pages to eeprom

2024-07-10T14:13:35Z

Before uninstalling gpu driver, flush all cached ras bad pages to eeprom. v2: Put the same code into a function and reuse the function. Signed-off-by: YiPeng Chai Reviewed-by: Tao Zhou Signed-off-by: Alex Deucher

drm/amdgpu: add ras event state device attribute support

2024-07-08T20:56:13Z

add amdgpu ras 'event_state' sysfs device attribute support Signed-off-by: Yang Wang Reviewed-by: Tao Zhou Reviewed-by: Hawking Zhang Signed-off-by: Alex Deucher

drm/amdgpu: add ras POSION_CONSUMPTION event id support

2024-07-08T20:55:37Z

add amdgpu ras POSION_CONSUMPTION event id support. Signed-off-by: Yang Wang Reviewed-by: Tao Zhou Reviewed-by: Hawking Zhang Signed-off-by: Alex Deucher

drm/amdgpu: add ras POSION_CREATION event id support

2024-07-08T20:55:18Z

add amdgpu ras POSION_CREATION event id support. Signed-off-by: Yang Wang Reviewed-by: Tao Zhou Reviewed-by: Hawking Zhang Signed-off-by: Alex Deucher

drm/amdgpu: refine amdgpu ras event id core code

2024-07-08T20:55:11Z

v1: - use unified event id to manage ras events - add a new function amdgpu_ras_query_error_status_with_event() to accept event type as parameter. v2: add a warn log to show the location of function failure when calling amdgpu_ras_mark_event(). (Tao Zhou) v3: change RAS_EVENT_TYPE_ISR to RAS_EVENT_TYPE_FATAL. v4: rename amdgpu_ras_get_recovery_event() to amdgpu_ras_get_fatal_error_event(). Signed-off-by: Yang Wang Reviewed-by: Tao Zhou Reviewed-by: Hawking Zhang Signed-off-by: Alex Deucher

drm/amdgpu: sysfs node disable query error count during gpu reset

2024-07-08T20:46:14Z

Sysfs node disable query error count during gpu reset. Signed-off-by: YiPeng Chai Reviewed-by: Stanley.Yang Signed-off-by: Alex Deucher

drm/amdgpu: Fix hbm stack id in boot error report

2024-07-01T20:10:47Z

To align with firmware, hbm id field 0x1 refers to hbm stack 0, 0x2 refers to hbm statck 1. Signed-off-by: Hawking Zhang Reviewed-by: Tao Zhou Signed-off-by: Alex Deucher

drm/amdgpu: add gpu reset check and exception handling

2024-06-27T21:32:12Z

Add gpu reset check and exception handling for page retirement. v2: Clear poison consumption messages cached in fifo after non mode-1 reset. Signed-off-by: YiPeng Chai Reviewed-by: Hawking Zhang Signed-off-by: Alex Deucher

drm/amdgpu: refine poison consumption interrupt handler

2024-06-27T21:32:06Z

1. The poison fifo is only used for poison consumption requests. 2. Merge reset requests when poison fifo caches multiple poison consumption messages Signed-off-by: YiPeng Chai Reviewed-by: Hawking Zhang Signed-off-by: Alex Deucher