From f0e1a0643a59bf1f922fa209cec86a170b784f3f Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Tue, 18 Jun 2024 10:09:17 -1000 Subject: sched_ext: Implement BPF extensible scheduler class MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Implement a new scheduler class sched_ext (SCX), which allows scheduling policies to be implemented as BPF programs to achieve the following: 1. Ease of experimentation and exploration: Enabling rapid iteration of new scheduling policies. 2. Customization: Building application-specific schedulers which implement policies that are not applicable to general-purpose schedulers. 3. Rapid scheduler deployments: Non-disruptive swap outs of scheduling policies in production environments. sched_ext leverages BPF’s struct_ops feature to define a structure which exports function callbacks and flags to BPF programs that wish to implement scheduling policies. The struct_ops structure exported by sched_ext is struct sched_ext_ops, and is conceptually similar to struct sched_class. The role of sched_ext is to map the complex sched_class callbacks to the more simple and ergonomic struct sched_ext_ops callbacks. For more detailed discussion on the motivations and overview, please refer to the cover letter. Later patches will also add several example schedulers and documentation. This patch implements the minimum core framework to enable implementation of BPF schedulers. Subsequent patches will gradually add functionalities including safety guarantee mechanisms, nohz and cgroup support. include/linux/sched/ext.h defines struct sched_ext_ops. With the comment on top, each operation should be self-explanatory. The followings are worth noting: - Both "sched_ext" and its shorthand "scx" are used. If the identifier already has "sched" in it, "ext" is used; otherwise, "scx". - In sched_ext_ops, only .name is mandatory. Every operation is optional and if omitted a simple but functional default behavior is provided. - A new policy constant SCHED_EXT is added and a task can select sched_ext by invoking sched_setscheduler(2) with the new policy constant. However, if the BPF scheduler is not loaded, SCHED_EXT is the same as SCHED_NORMAL and the task is scheduled by CFS. When the BPF scheduler is loaded, all tasks which have the SCHED_EXT policy are switched to sched_ext. - To bridge the workflow imbalance between the scheduler core and sched_ext_ops callbacks, sched_ext uses simple FIFOs called dispatch queues (dsq's). By default, there is one global dsq (SCX_DSQ_GLOBAL), and one local per-CPU dsq (SCX_DSQ_LOCAL). SCX_DSQ_GLOBAL is provided for convenience and need not be used by a scheduler that doesn't require it. SCX_DSQ_LOCAL is the per-CPU FIFO that sched_ext pulls from when putting the next task on the CPU. The BPF scheduler can manage an arbitrary number of dsq's using scx_bpf_create_dsq() and scx_bpf_destroy_dsq(). - sched_ext guarantees system integrity no matter what the BPF scheduler does. To enable this, each task's ownership is tracked through p->scx.ops_state and all tasks are put on scx_tasks list. The disable path can always recover and revert all tasks back to CFS. See p->scx.ops_state and scx_tasks. - A task is not tied to its rq while enqueued. This decouples CPU selection from queueing and allows sharing a scheduling queue across an arbitrary subset of CPUs. This adds some complexities as a task may need to be bounced between rq's right before it starts executing. See dispatch_to_local_dsq() and move_task_to_local_dsq(). - One complication that arises from the above weak association between task and rq is that synchronizing with dequeue() gets complicated as dequeue() may happen anytime while the task is enqueued and the dispatch path might need to release the rq lock to transfer the task. Solving this requires a bit of complexity. See the logic around p->scx.sticky_cpu and p->scx.ops_qseq. - Both enable and disable paths are a bit complicated. The enable path switches all tasks without blocking to avoid issues which can arise from partially switched states (e.g. the switching task itself being starved). The disable path can't trust the BPF scheduler at all, so it also has to guarantee forward progress without blocking. See scx_ops_enable() and scx_ops_disable_workfn(). - When sched_ext is disabled, static_branches are used to shut down the entry points from hot paths. v7: - scx_ops_bypass() was incorrectly and unnecessarily trying to grab scx_ops_enable_mutex which can lead to deadlocks in the disable path. Fixed. - Fixed TASK_DEAD handling bug in scx_ops_enable() path which could lead to use-after-free. - Consolidated per-cpu variable usages and other cleanups. v6: - SCX_NR_ONLINE_OPS replaced with SCX_OPI_*_BEGIN/END so that multiple groups can be expressed. Later CPU hotplug operations are put into their own group. - SCX_OPS_DISABLING state is replaced with the new bypass mechanism which allows temporarily putting the system into simple FIFO scheduling mode bypassing the BPF scheduler. In addition to the shut down path, this will also be used to isolate the BPF scheduler across PM events. Enabling and disabling the bypass mode requires iterating all runnable tasks. rq->scx.runnable_list addition is moved from the later watchdog patch. - ops.prep_enable() is replaced with ops.init_task() and ops.enable/disable() are now called whenever the task enters and leaves sched_ext instead of when the task becomes schedulable on sched_ext and stops being so. A new operation - ops.exit_task() - is called when the task stops being schedulable on sched_ext. - scx_bpf_dispatch() can now be called from ops.select_cpu() too. This removes the need for communicating local dispatch decision made by ops.select_cpu() to ops.enqueue() via per-task storage. SCX_KF_SELECT_CPU is added to support the change. - SCX_TASK_ENQ_LOCAL which told the BPF scheudler that scx_select_cpu_dfl() wants the task to be dispatched to the local DSQ was removed. Instead, scx_bpf_select_cpu_dfl() now dispatches directly if it finds a suitable idle CPU. If such behavior is not desired, users can use scx_bpf_select_cpu_dfl() which returns the verdict in a bool out param. - scx_select_cpu_dfl() was mishandling WAKE_SYNC and could end up queueing many tasks on a local DSQ which makes tasks to execute in order while other CPUs stay idle which made some hackbench numbers really bad. Fixed. - The current state of sched_ext can now be monitored through files under /sys/sched_ext instead of /sys/kernel/debug/sched/ext. This is to enable monitoring on kernels which don't enable debugfs. - sched_ext wasn't telling BPF that ops.dispatch()'s @prev argument may be NULL and a BPF scheduler which derefs the pointer without checking could crash the kernel. Tell BPF. This is currently a bit ugly. A better way to annotate this is expected in the future. - scx_exit_info updated to carry pointers to message buffers instead of embedding them directly. This decouples buffer sizes from API so that they can be changed without breaking compatibility. - exit_code added to scx_exit_info. This is used to indicate different exit conditions on non-error exits and will be used to handle e.g. CPU hotplugs. - The patch "sched_ext: Allow BPF schedulers to switch all eligible tasks into sched_ext" is folded in and the interface is changed so that partial switching is indicated with a new ops flag %SCX_OPS_SWITCH_PARTIAL. This makes scx_bpf_switch_all() unnecessasry and in turn SCX_KF_INIT. ops.init() is now called with SCX_KF_SLEEPABLE. - Code reorganized so that only the parts necessary to integrate with the rest of the kernel are in the header files. - Changes to reflect the BPF and other kernel changes including the addition of bpf_sched_ext_ops.cfi_stubs. v5: - To accommodate 32bit configs, p->scx.ops_state is now atomic_long_t instead of atomic64_t and scx_dsp_buf_ent.qseq which uses load_acquire/store_release is now unsigned long instead of u64. - Fix the bug where bpf_scx_btf_struct_access() was allowing write access to arbitrary fields. - Distinguish kfuncs which can be called from any sched_ext ops and from anywhere. e.g. scx_bpf_pick_idle_cpu() can now be called only from sched_ext ops. - Rename "type" to "kind" in scx_exit_info to make it easier to use on languages in which "type" is a reserved keyword. - Since cff9b2332ab7 ("kernel/sched: Modify initial boot task idle setup"), PF_IDLE is not set on idle tasks which haven't been online yet which made scx_task_iter_next_filtered() include those idle tasks in iterations leading to oopses. Update scx_task_iter_next_filtered() to directly test p->sched_class against idle_sched_class instead of using is_idle_task() which tests PF_IDLE. - Other updates to match upstream changes such as adding const to set_cpumask() param and renaming check_preempt_curr() to wakeup_preempt(). v4: - SCHED_CHANGE_BLOCK replaced with the previous sched_deq_and_put_task()/sched_enq_and_set_tsak() pair. This is because upstream is adaopting a different generic cleanup mechanism. Once that lands, the code will be adapted accordingly. - task_on_scx() used to test whether a task should be switched into SCX, which is confusing. Renamed to task_should_scx(). task_on_scx() now tests whether a task is currently on SCX. - scx_has_idle_cpus is barely used anymore and replaced with direct check on the idle cpumask. - SCX_PICK_IDLE_CORE added and scx_pick_idle_cpu() improved to prefer fully idle cores. - ops.enable() now sees up-to-date p->scx.weight value. - ttwu_queue path is disabled for tasks on SCX to avoid confusing BPF schedulers expecting ->select_cpu() call. - Use cpu_smt_mask() instead of topology_sibling_cpumask() like the rest of the scheduler. v3: - ops.set_weight() added to allow BPF schedulers to track weight changes without polling p->scx.weight. - move_task_to_local_dsq() was losing SCX-specific enq_flags when enqueueing the task on the target dsq because it goes through activate_task() which loses the upper 32bit of the flags. Carry the flags through rq->scx.extra_enq_flags. - scx_bpf_dispatch(), scx_bpf_pick_idle_cpu(), scx_bpf_task_running() and scx_bpf_task_cpu() now use the new KF_RCU instead of KF_TRUSTED_ARGS to make it easier for BPF schedulers to call them. - The kfunc helper access control mechanism implemented through sched_ext_entity.kf_mask is improved. Now SCX_CALL_OP*() is always used when invoking scx_ops operations. v2: - balance_scx_on_up() is dropped. Instead, on UP, balance_scx() is called from put_prev_taks_scx() and pick_next_task_scx() as necessary. To determine whether balance_scx() should be called from put_prev_task_scx(), SCX_TASK_DEQD_FOR_SLEEP flag is added. See the comment in put_prev_task_scx() for details. - sched_deq_and_put_task() / sched_enq_and_set_task() sequences replaced with SCHED_CHANGE_BLOCK(). - Unused all_dsqs list removed. This was a left-over from previous iterations. - p->scx.kf_mask is added to track and enforce which kfunc helpers are allowed. Also, init/exit sequences are updated to make some kfuncs always safe to call regardless of the current BPF scheduler state. Combined, this should make all the kfuncs safe. - BPF now supports sleepable struct_ops operations. Hacky workaround removed and operations and kfunc helpers are tagged appropriately. - BPF now supports bitmask / cpumask helpers. scx_bpf_get_idle_cpumask() and friends are added so that BPF schedulers can use the idle masks with the generic helpers. This replaces the hacky kfunc helpers added by a separate patch in V1. - CONFIG_SCHED_CLASS_EXT can no longer be enabled if SCHED_CORE is enabled. This restriction will be removed by a later patch which adds core-sched support. - Add MAINTAINERS entries and other misc changes. Signed-off-by: Tejun Heo Co-authored-by: David Vernet Acked-by: Josh Don Acked-by: Hao Luo Acked-by: Barret Rhoden Cc: Andrea Righi --- include/uapi/linux/sched.h | 1 + 1 file changed, 1 insertion(+) (limited to 'include/uapi') diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h index 3bac0a8ceab2..359a14cc76a4 100644 --- a/include/uapi/linux/sched.h +++ b/include/uapi/linux/sched.h @@ -118,6 +118,7 @@ struct clone_args { /* SCHED_ISO: reserved but not implemented yet */ #define SCHED_IDLE 5 #define SCHED_DEADLINE 6 +#define SCHED_EXT 7 /* Can be ORed in to make sure the process is reverted back to SCHED_NORMAL on fork */ #define SCHED_RESET_ON_FORK 0x40000000 -- cgit v1.2.3 From 8169b2097d88d99d7e4a72e20e4b549efe9eb8d7 Mon Sep 17 00:00:00 2001 From: Ashutosh Dixit Date: Wed, 3 Jul 2024 09:48:01 -0700 Subject: drm/xe/uapi: Rename xe perf layer as xe observation layer MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit In Xe, the perf layer allows capture of HW counter streams. These HW counters are generally performance related but don't have to be necessarily so. Also, the name "perf" is a carryover from i915 and is not preferred. Here we propose the name "observation" for this common layer which allows capture of different types of these counter streams. v2: Rename observability layer to observation layer (Lucas/Rodrigo) v3: Rename sysctl file to "observation_paranoid" (Jose) Fixes: 52c2e956dceb ("drm/xe/perf/uapi: "Perf" layer to support multiple perf counter stream types") Fixes: fe8929bdf835 ("drm/xe/perf/uapi: Add perf_stream_paranoid sysctl") Acked-by: Lucas De Marchi Acked-by: Rodrigo Vivi Signed-off-by: Ashutosh Dixit Reviewed-by: Umesh Nerlige Ramappa Acked-by: José Roberto de Souza Link: https://patchwork.freedesktop.org/patch/msgid/20240703164801.2561423-1-ashutosh.dixit@intel.com --- drivers/gpu/drm/xe/Makefile | 2 +- drivers/gpu/drm/xe/xe_device.c | 4 +- drivers/gpu/drm/xe/xe_device_types.h | 2 +- drivers/gpu/drm/xe/xe_gt_types.h | 2 +- drivers/gpu/drm/xe/xe_module.c | 6 +-- drivers/gpu/drm/xe/xe_oa.c | 34 ++++++------ drivers/gpu/drm/xe/xe_observation.c | 93 ++++++++++++++++++++++++++++++++ drivers/gpu/drm/xe/xe_observation.h | 20 +++++++ drivers/gpu/drm/xe/xe_perf.c | 92 ------------------------------- drivers/gpu/drm/xe/xe_perf.h | 20 ------- include/uapi/drm/xe_drm.h | 102 ++++++++++++++++++----------------- 11 files changed, 190 insertions(+), 187 deletions(-) create mode 100644 drivers/gpu/drm/xe/xe_observation.c create mode 100644 drivers/gpu/drm/xe/xe_observation.h delete mode 100644 drivers/gpu/drm/xe/xe_perf.c delete mode 100644 drivers/gpu/drm/xe/xe_perf.h (limited to 'include/uapi') diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile index b1e03bfe4a68..628c245c4822 100644 --- a/drivers/gpu/drm/xe/Makefile +++ b/drivers/gpu/drm/xe/Makefile @@ -96,10 +96,10 @@ xe-y += xe_bb.o \ xe_mocs.o \ xe_module.o \ xe_oa.o \ + xe_observation.o \ xe_pat.o \ xe_pci.o \ xe_pcode.o \ - xe_perf.o \ xe_pm.o \ xe_preempt_fence.o \ xe_pt.o \ diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c index cfda7cb5df2c..03492fbcb8fb 100644 --- a/drivers/gpu/drm/xe/xe_device.c +++ b/drivers/gpu/drm/xe/xe_device.c @@ -42,9 +42,9 @@ #include "xe_memirq.h" #include "xe_mmio.h" #include "xe_module.h" +#include "xe_observation.h" #include "xe_pat.h" #include "xe_pcode.h" -#include "xe_perf.h" #include "xe_pm.h" #include "xe_query.h" #include "xe_sriov.h" @@ -142,7 +142,7 @@ static const struct drm_ioctl_desc xe_ioctls[] = { DRM_RENDER_ALLOW), DRM_IOCTL_DEF_DRV(XE_WAIT_USER_FENCE, xe_wait_user_fence_ioctl, DRM_RENDER_ALLOW), - DRM_IOCTL_DEF_DRV(XE_PERF, xe_perf_ioctl, DRM_RENDER_ALLOW), + DRM_IOCTL_DEF_DRV(XE_OBSERVATION, xe_observation_ioctl, DRM_RENDER_ALLOW), }; static long xe_drm_ioctl(struct file *file, unsigned int cmd, unsigned long arg) diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h index c37be471d11c..3bca6d344744 100644 --- a/drivers/gpu/drm/xe/xe_device_types.h +++ b/drivers/gpu/drm/xe/xe_device_types.h @@ -463,7 +463,7 @@ struct xe_device { /** @heci_gsc: graphics security controller */ struct xe_heci_gsc heci_gsc; - /** @oa: oa perf counter subsystem */ + /** @oa: oa observation subsystem */ struct xe_oa oa; /** @needs_flr_on_fini: requests function-reset on fini */ diff --git a/drivers/gpu/drm/xe/xe_gt_types.h b/drivers/gpu/drm/xe/xe_gt_types.h index 24bb95de920f..6b5e0b45efb0 100644 --- a/drivers/gpu/drm/xe/xe_gt_types.h +++ b/drivers/gpu/drm/xe/xe_gt_types.h @@ -389,7 +389,7 @@ struct xe_gt { u8 instances_per_class[XE_ENGINE_CLASS_MAX]; } user_engines; - /** @oa: oa perf counter subsystem per gt info */ + /** @oa: oa observation subsystem per gt info */ struct xe_oa_gt oa; }; diff --git a/drivers/gpu/drm/xe/xe_module.c b/drivers/gpu/drm/xe/xe_module.c index 893858a2eea0..499540add465 100644 --- a/drivers/gpu/drm/xe/xe_module.c +++ b/drivers/gpu/drm/xe/xe_module.c @@ -11,7 +11,7 @@ #include "xe_drv.h" #include "xe_hw_fence.h" #include "xe_pci.h" -#include "xe_perf.h" +#include "xe_observation.h" #include "xe_sched_job.h" struct xe_modparam xe_modparam = { @@ -80,8 +80,8 @@ static const struct init_funcs init_funcs[] = { .exit = xe_unregister_pci_driver, }, { - .init = xe_perf_sysctl_register, - .exit = xe_perf_sysctl_unregister, + .init = xe_observation_sysctl_register, + .exit = xe_observation_sysctl_unregister, }, }; diff --git a/drivers/gpu/drm/xe/xe_oa.c b/drivers/gpu/drm/xe/xe_oa.c index 4188516a7816..6d69f751bf78 100644 --- a/drivers/gpu/drm/xe/xe_oa.c +++ b/drivers/gpu/drm/xe/xe_oa.c @@ -32,7 +32,7 @@ #include "xe_macros.h" #include "xe_mmio.h" #include "xe_oa.h" -#include "xe_perf.h" +#include "xe_observation.h" #include "xe_pm.h" #include "xe_sched_job.h" #include "xe_sriov.h" @@ -481,7 +481,7 @@ static int __xe_oa_read(struct xe_oa_stream *stream, char __user *buf, OASTATUS_RELEVANT_BITS, 0); /* * Signal to userspace that there is non-zero OA status to read via - * @DRM_XE_PERF_IOCTL_STATUS perf fd ioctl + * @DRM_XE_OBSERVATION_IOCTL_STATUS observation stream fd ioctl */ if (stream->oa_status & OASTATUS_RELEVANT_BITS) return -EIO; @@ -1158,15 +1158,15 @@ static long xe_oa_ioctl_locked(struct xe_oa_stream *stream, unsigned long arg) { switch (cmd) { - case DRM_XE_PERF_IOCTL_ENABLE: + case DRM_XE_OBSERVATION_IOCTL_ENABLE: return xe_oa_enable_locked(stream); - case DRM_XE_PERF_IOCTL_DISABLE: + case DRM_XE_OBSERVATION_IOCTL_DISABLE: return xe_oa_disable_locked(stream); - case DRM_XE_PERF_IOCTL_CONFIG: + case DRM_XE_OBSERVATION_IOCTL_CONFIG: return xe_oa_config_locked(stream, arg); - case DRM_XE_PERF_IOCTL_STATUS: + case DRM_XE_OBSERVATION_IOCTL_STATUS: return xe_oa_status_locked(stream, arg); - case DRM_XE_PERF_IOCTL_INFO: + case DRM_XE_OBSERVATION_IOCTL_INFO: return xe_oa_info_locked(stream, arg); } @@ -1209,7 +1209,7 @@ static int xe_oa_release(struct inode *inode, struct file *file) xe_oa_destroy_locked(stream); mutex_unlock(>->oa.gt_lock); - /* Release the reference the perf stream kept on the driver */ + /* Release the reference the OA stream kept on the driver */ drm_dev_put(>_to_xe(gt)->drm); return 0; @@ -1222,7 +1222,7 @@ static int xe_oa_mmap(struct file *file, struct vm_area_struct *vma) unsigned long start = vma->vm_start; int i, ret; - if (xe_perf_stream_paranoid && !perfmon_capable()) { + if (xe_observation_paranoid && !perfmon_capable()) { drm_dbg(&stream->oa->xe->drm, "Insufficient privilege to map OA buffer\n"); return -EACCES; } @@ -1789,8 +1789,8 @@ static int xe_oa_user_extensions(struct xe_oa *oa, u64 extension, int ext_number * @file: @drm_file * * The functions opens an OA stream. An OA stream, opened with specified - * properties, enables perf counter samples to be collected, either - * periodically (time based sampling), or on request (using perf queries) + * properties, enables OA counter samples to be collected, either + * periodically (time based sampling), or on request (using OA queries) */ int xe_oa_stream_open_ioctl(struct drm_device *dev, u64 data, struct drm_file *file) { @@ -1836,8 +1836,8 @@ int xe_oa_stream_open_ioctl(struct drm_device *dev, u64 data, struct drm_file *f privileged_op = true; } - if (privileged_op && xe_perf_stream_paranoid && !perfmon_capable()) { - drm_dbg(&oa->xe->drm, "Insufficient privileges to open xe perf stream\n"); + if (privileged_op && xe_observation_paranoid && !perfmon_capable()) { + drm_dbg(&oa->xe->drm, "Insufficient privileges to open xe OA stream\n"); ret = -EACCES; goto err_exec_q; } @@ -2097,7 +2097,7 @@ int xe_oa_add_config_ioctl(struct drm_device *dev, u64 data, struct drm_file *fi return -ENODEV; } - if (xe_perf_stream_paranoid && !perfmon_capable()) { + if (xe_observation_paranoid && !perfmon_capable()) { drm_dbg(&oa->xe->drm, "Insufficient privileges to add xe OA config\n"); return -EACCES; } @@ -2181,7 +2181,7 @@ reg_err: /** * xe_oa_remove_config_ioctl - Removes one OA config * @dev: @drm_device - * @data: pointer to struct @drm_xe_perf_param + * @data: pointer to struct @drm_xe_observation_param * @file: @drm_file */ int xe_oa_remove_config_ioctl(struct drm_device *dev, u64 data, struct drm_file *file) @@ -2197,7 +2197,7 @@ int xe_oa_remove_config_ioctl(struct drm_device *dev, u64 data, struct drm_file return -ENODEV; } - if (xe_perf_stream_paranoid && !perfmon_capable()) { + if (xe_observation_paranoid && !perfmon_capable()) { drm_dbg(&oa->xe->drm, "Insufficient privileges to remove xe OA config\n"); return -EACCES; } @@ -2381,7 +2381,7 @@ static int xe_oa_init_gt(struct xe_gt *gt) /* * Fused off engines can result in oa_unit's with num_engines == 0. These units - * will appear in OA unit query, but no perf streams can be opened on them. + * will appear in OA unit query, but no OA streams can be opened on them. */ gt->oa.num_oa_units = num_oa_units; gt->oa.oa_unit = u; diff --git a/drivers/gpu/drm/xe/xe_observation.c b/drivers/gpu/drm/xe/xe_observation.c new file mode 100644 index 000000000000..fcb584b42a7d --- /dev/null +++ b/drivers/gpu/drm/xe/xe_observation.c @@ -0,0 +1,93 @@ +// SPDX-License-Identifier: MIT +/* + * Copyright © 2023-2024 Intel Corporation + */ + +#include +#include + +#include + +#include "xe_oa.h" +#include "xe_observation.h" + +u32 xe_observation_paranoid = true; +static struct ctl_table_header *sysctl_header; + +static int xe_oa_ioctl(struct drm_device *dev, struct drm_xe_observation_param *arg, + struct drm_file *file) +{ + switch (arg->observation_op) { + case DRM_XE_OBSERVATION_OP_STREAM_OPEN: + return xe_oa_stream_open_ioctl(dev, arg->param, file); + case DRM_XE_OBSERVATION_OP_ADD_CONFIG: + return xe_oa_add_config_ioctl(dev, arg->param, file); + case DRM_XE_OBSERVATION_OP_REMOVE_CONFIG: + return xe_oa_remove_config_ioctl(dev, arg->param, file); + default: + return -EINVAL; + } +} + +/** + * xe_observation_ioctl - The top level observation layer ioctl + * @dev: @drm_device + * @data: pointer to struct @drm_xe_observation_param + * @file: @drm_file + * + * The function is called for different observation streams types and + * allows execution of different operations supported by those stream + * types. + * + * Return: 0 on success or a negative error code on failure. + */ +int xe_observation_ioctl(struct drm_device *dev, void *data, struct drm_file *file) +{ + struct drm_xe_observation_param *arg = data; + + if (arg->extensions) + return -EINVAL; + + switch (arg->observation_type) { + case DRM_XE_OBSERVATION_TYPE_OA: + return xe_oa_ioctl(dev, arg, file); + default: + return -EINVAL; + } +} + +static struct ctl_table observation_ctl_table[] = { + { + .procname = "observation_paranoid", + .data = &xe_observation_paranoid, + .maxlen = sizeof(xe_observation_paranoid), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + .extra1 = SYSCTL_ZERO, + .extra2 = SYSCTL_ONE, + }, + {} +}; + +/** + * xe_observation_sysctl_register - Register xe_observation_paranoid sysctl + * + * Normally only superuser/root can access observation stream + * data. However, superuser can set xe_observation_paranoid sysctl to 0 to + * allow non-privileged users to also access observation data. + * + * Return: always returns 0 + */ +int xe_observation_sysctl_register(void) +{ + sysctl_header = register_sysctl("dev/xe", observation_ctl_table); + return 0; +} + +/** + * xe_observation_sysctl_unregister - Unregister xe_observation_paranoid sysctl + */ +void xe_observation_sysctl_unregister(void) +{ + unregister_sysctl_table(sysctl_header); +} diff --git a/drivers/gpu/drm/xe/xe_observation.h b/drivers/gpu/drm/xe/xe_observation.h new file mode 100644 index 000000000000..17816998e966 --- /dev/null +++ b/drivers/gpu/drm/xe/xe_observation.h @@ -0,0 +1,20 @@ +/* SPDX-License-Identifier: MIT */ +/* + * Copyright © 2023-2024 Intel Corporation + */ + +#ifndef _XE_OBSERVATION_H_ +#define _XE_OBSERVATION_H_ + +#include + +struct drm_device; +struct drm_file; + +extern u32 xe_observation_paranoid; + +int xe_observation_ioctl(struct drm_device *dev, void *data, struct drm_file *file); +int xe_observation_sysctl_register(void); +void xe_observation_sysctl_unregister(void); + +#endif diff --git a/drivers/gpu/drm/xe/xe_perf.c b/drivers/gpu/drm/xe/xe_perf.c deleted file mode 100644 index d6cd74cadf34..000000000000 --- a/drivers/gpu/drm/xe/xe_perf.c +++ /dev/null @@ -1,92 +0,0 @@ -// SPDX-License-Identifier: MIT -/* - * Copyright © 2023-2024 Intel Corporation - */ - -#include -#include - -#include - -#include "xe_oa.h" -#include "xe_perf.h" - -u32 xe_perf_stream_paranoid = true; -static struct ctl_table_header *sysctl_header; - -static int xe_oa_ioctl(struct drm_device *dev, struct drm_xe_perf_param *arg, - struct drm_file *file) -{ - switch (arg->perf_op) { - case DRM_XE_PERF_OP_STREAM_OPEN: - return xe_oa_stream_open_ioctl(dev, arg->param, file); - case DRM_XE_PERF_OP_ADD_CONFIG: - return xe_oa_add_config_ioctl(dev, arg->param, file); - case DRM_XE_PERF_OP_REMOVE_CONFIG: - return xe_oa_remove_config_ioctl(dev, arg->param, file); - default: - return -EINVAL; - } -} - -/** - * xe_perf_ioctl - The top level perf layer ioctl - * @dev: @drm_device - * @data: pointer to struct @drm_xe_perf_param - * @file: @drm_file - * - * The function is called for different perf streams types and allows execution - * of different operations supported by those perf stream types. - * - * Return: 0 on success or a negative error code on failure. - */ -int xe_perf_ioctl(struct drm_device *dev, void *data, struct drm_file *file) -{ - struct drm_xe_perf_param *arg = data; - - if (arg->extensions) - return -EINVAL; - - switch (arg->perf_type) { - case DRM_XE_PERF_TYPE_OA: - return xe_oa_ioctl(dev, arg, file); - default: - return -EINVAL; - } -} - -static struct ctl_table perf_ctl_table[] = { - { - .procname = "perf_stream_paranoid", - .data = &xe_perf_stream_paranoid, - .maxlen = sizeof(xe_perf_stream_paranoid), - .mode = 0644, - .proc_handler = proc_dointvec_minmax, - .extra1 = SYSCTL_ZERO, - .extra2 = SYSCTL_ONE, - }, - {} -}; - -/** - * xe_perf_sysctl_register - Register "perf_stream_paranoid" sysctl - * - * Normally only superuser/root can access perf counter data. However, - * superuser can set perf_stream_paranoid sysctl to 0 to allow non-privileged - * users to also access perf data. - * - * Return: always returns 0 - */ -int xe_perf_sysctl_register(void) -{ - sysctl_header = register_sysctl("dev/xe", perf_ctl_table); - return 0; -} - -/** - * xe_perf_sysctl_unregister - Unregister "perf_stream_paranoid" sysctl - */ -void xe_perf_sysctl_unregister(void) -{ - unregister_sysctl_table(sysctl_header); -} diff --git a/drivers/gpu/drm/xe/xe_perf.h b/drivers/gpu/drm/xe/xe_perf.h deleted file mode 100644 index 53a8377a1bb1..000000000000 --- a/drivers/gpu/drm/xe/xe_perf.h +++ /dev/null @@ -1,20 +0,0 @@ -/* SPDX-License-Identifier: MIT */ -/* - * Copyright © 2023-2024 Intel Corporation - */ - -#ifndef _XE_PERF_H_ -#define _XE_PERF_H_ - -#include - -struct drm_device; -struct drm_file; - -extern u32 xe_perf_stream_paranoid; - -int xe_perf_ioctl(struct drm_device *dev, void *data, struct drm_file *file); -int xe_perf_sysctl_register(void); -void xe_perf_sysctl_unregister(void); - -#endif diff --git a/include/uapi/drm/xe_drm.h b/include/uapi/drm/xe_drm.h index 12eaa8532b5c..33544ef78d3e 100644 --- a/include/uapi/drm/xe_drm.h +++ b/include/uapi/drm/xe_drm.h @@ -80,7 +80,7 @@ extern "C" { * - &DRM_IOCTL_XE_EXEC_QUEUE_GET_PROPERTY * - &DRM_IOCTL_XE_EXEC * - &DRM_IOCTL_XE_WAIT_USER_FENCE - * - &DRM_IOCTL_XE_PERF + * - &DRM_IOCTL_XE_OBSERVATION */ /* @@ -101,7 +101,7 @@ extern "C" { #define DRM_XE_EXEC_QUEUE_GET_PROPERTY 0x08 #define DRM_XE_EXEC 0x09 #define DRM_XE_WAIT_USER_FENCE 0x0a -#define DRM_XE_PERF 0x0b +#define DRM_XE_OBSERVATION 0x0b /* Must be kept compact -- no holes */ @@ -116,7 +116,7 @@ extern "C" { #define DRM_IOCTL_XE_EXEC_QUEUE_GET_PROPERTY DRM_IOWR(DRM_COMMAND_BASE + DRM_XE_EXEC_QUEUE_GET_PROPERTY, struct drm_xe_exec_queue_get_property) #define DRM_IOCTL_XE_EXEC DRM_IOW(DRM_COMMAND_BASE + DRM_XE_EXEC, struct drm_xe_exec) #define DRM_IOCTL_XE_WAIT_USER_FENCE DRM_IOWR(DRM_COMMAND_BASE + DRM_XE_WAIT_USER_FENCE, struct drm_xe_wait_user_fence) -#define DRM_IOCTL_XE_PERF DRM_IOW(DRM_COMMAND_BASE + DRM_XE_PERF, struct drm_xe_perf_param) +#define DRM_IOCTL_XE_OBSERVATION DRM_IOW(DRM_COMMAND_BASE + DRM_XE_OBSERVATION, struct drm_xe_observation_param) /** * DOC: Xe IOCTL Extensions @@ -1376,66 +1376,67 @@ struct drm_xe_wait_user_fence { }; /** - * enum drm_xe_perf_type - Perf stream types + * enum drm_xe_observation_type - Observation stream types */ -enum drm_xe_perf_type { - /** @DRM_XE_PERF_TYPE_OA: OA perf stream type */ - DRM_XE_PERF_TYPE_OA, +enum drm_xe_observation_type { + /** @DRM_XE_OBSERVATION_TYPE_OA: OA observation stream type */ + DRM_XE_OBSERVATION_TYPE_OA, }; /** - * enum drm_xe_perf_op - Perf stream ops + * enum drm_xe_observation_op - Observation stream ops */ -enum drm_xe_perf_op { - /** @DRM_XE_PERF_OP_STREAM_OPEN: Open a perf counter stream */ - DRM_XE_PERF_OP_STREAM_OPEN, +enum drm_xe_observation_op { + /** @DRM_XE_OBSERVATION_OP_STREAM_OPEN: Open an observation stream */ + DRM_XE_OBSERVATION_OP_STREAM_OPEN, - /** @DRM_XE_PERF_OP_ADD_CONFIG: Add perf stream config */ - DRM_XE_PERF_OP_ADD_CONFIG, + /** @DRM_XE_OBSERVATION_OP_ADD_CONFIG: Add observation stream config */ + DRM_XE_OBSERVATION_OP_ADD_CONFIG, - /** @DRM_XE_PERF_OP_REMOVE_CONFIG: Remove perf stream config */ - DRM_XE_PERF_OP_REMOVE_CONFIG, + /** @DRM_XE_OBSERVATION_OP_REMOVE_CONFIG: Remove observation stream config */ + DRM_XE_OBSERVATION_OP_REMOVE_CONFIG, }; /** - * struct drm_xe_perf_param - Input of &DRM_XE_PERF + * struct drm_xe_observation_param - Input of &DRM_XE_OBSERVATION * - * The perf layer enables multiplexing perf counter streams of multiple - * types. The actual params for a particular stream operation are supplied - * via the @param pointer (use __copy_from_user to get these params). + * The observation layer enables multiplexing observation streams of + * multiple types. The actual params for a particular stream operation are + * supplied via the @param pointer (use __copy_from_user to get these + * params). */ -struct drm_xe_perf_param { +struct drm_xe_observation_param { /** @extensions: Pointer to the first extension struct, if any */ __u64 extensions; - /** @perf_type: Perf stream type, of enum @drm_xe_perf_type */ - __u64 perf_type; - /** @perf_op: Perf op, of enum @drm_xe_perf_op */ - __u64 perf_op; + /** @observation_type: observation stream type, of enum @drm_xe_observation_type */ + __u64 observation_type; + /** @observation_op: observation stream op, of enum @drm_xe_observation_op */ + __u64 observation_op; /** @param: Pointer to actual stream params */ __u64 param; }; /** - * enum drm_xe_perf_ioctls - Perf fd ioctl's + * enum drm_xe_observation_ioctls - Observation stream fd ioctl's * - * Information exchanged between userspace and kernel for perf fd ioctl's - * is stream type specific + * Information exchanged between userspace and kernel for observation fd + * ioctl's is stream type specific */ -enum drm_xe_perf_ioctls { - /** @DRM_XE_PERF_IOCTL_ENABLE: Enable data capture for a stream */ - DRM_XE_PERF_IOCTL_ENABLE = _IO('i', 0x0), +enum drm_xe_observation_ioctls { + /** @DRM_XE_OBSERVATION_IOCTL_ENABLE: Enable data capture for an observation stream */ + DRM_XE_OBSERVATION_IOCTL_ENABLE = _IO('i', 0x0), - /** @DRM_XE_PERF_IOCTL_DISABLE: Disable data capture for a stream */ - DRM_XE_PERF_IOCTL_DISABLE = _IO('i', 0x1), + /** @DRM_XE_OBSERVATION_IOCTL_DISABLE: Disable data capture for a observation stream */ + DRM_XE_OBSERVATION_IOCTL_DISABLE = _IO('i', 0x1), - /** @DRM_XE_PERF_IOCTL_CONFIG: Change stream configuration */ - DRM_XE_PERF_IOCTL_CONFIG = _IO('i', 0x2), + /** @DRM_XE_OBSERVATION_IOCTL_CONFIG: Change observation stream configuration */ + DRM_XE_OBSERVATION_IOCTL_CONFIG = _IO('i', 0x2), - /** @DRM_XE_PERF_IOCTL_STATUS: Return stream status */ - DRM_XE_PERF_IOCTL_STATUS = _IO('i', 0x3), + /** @DRM_XE_OBSERVATION_IOCTL_STATUS: Return observation stream status */ + DRM_XE_OBSERVATION_IOCTL_STATUS = _IO('i', 0x3), - /** @DRM_XE_PERF_IOCTL_INFO: Return stream info */ - DRM_XE_PERF_IOCTL_INFO = _IO('i', 0x4), + /** @DRM_XE_OBSERVATION_IOCTL_INFO: Return observation stream info */ + DRM_XE_OBSERVATION_IOCTL_INFO = _IO('i', 0x4), }; /** @@ -1546,12 +1547,12 @@ enum drm_xe_oa_format_type { * Stream params are specified as a chain of @drm_xe_ext_set_property * struct's, with @property values from enum @drm_xe_oa_property_id and * @drm_xe_user_extension base.name set to @DRM_XE_OA_EXTENSION_SET_PROPERTY. - * @param field in struct @drm_xe_perf_param points to the first + * @param field in struct @drm_xe_observation_param points to the first * @drm_xe_ext_set_property struct. * - * Exactly the same mechanism is also used for stream reconfiguration using - * the @DRM_XE_PERF_IOCTL_CONFIG perf fd ioctl, though only a subset of - * properties below can be specified for stream reconfiguration. + * Exactly the same mechanism is also used for stream reconfiguration using the + * @DRM_XE_OBSERVATION_IOCTL_CONFIG observation stream fd ioctl, though only a + * subset of properties below can be specified for stream reconfiguration. */ enum drm_xe_oa_property_id { #define DRM_XE_OA_EXTENSION_SET_PROPERTY 0 @@ -1571,11 +1572,11 @@ enum drm_xe_oa_property_id { /** * @DRM_XE_OA_PROPERTY_OA_METRIC_SET: OA metrics defining contents of OA - * reports, previously added via @DRM_XE_PERF_OP_ADD_CONFIG. + * reports, previously added via @DRM_XE_OBSERVATION_OP_ADD_CONFIG. */ DRM_XE_OA_PROPERTY_OA_METRIC_SET, - /** @DRM_XE_OA_PROPERTY_OA_FORMAT: Perf counter report format */ + /** @DRM_XE_OA_PROPERTY_OA_FORMAT: OA counter report format */ DRM_XE_OA_PROPERTY_OA_FORMAT, /* * OA_FORMAT's are specified the same way as in PRM/Bspec 52198/60942, @@ -1596,13 +1597,13 @@ enum drm_xe_oa_property_id { /** * @DRM_XE_OA_PROPERTY_OA_DISABLED: A value of 1 will open the OA - * stream in a DISABLED state (see @DRM_XE_PERF_IOCTL_ENABLE). + * stream in a DISABLED state (see @DRM_XE_OBSERVATION_IOCTL_ENABLE). */ DRM_XE_OA_PROPERTY_OA_DISABLED, /** * @DRM_XE_OA_PROPERTY_EXEC_QUEUE_ID: Open the stream for a specific - * @exec_queue_id. Perf queries can be executed on this exec queue. + * @exec_queue_id. OA queries can be executed on this exec queue. */ DRM_XE_OA_PROPERTY_EXEC_QUEUE_ID, @@ -1622,7 +1623,7 @@ enum drm_xe_oa_property_id { /** * struct drm_xe_oa_config - OA metric configuration * - * Multiple OA configs can be added using @DRM_XE_PERF_OP_ADD_CONFIG. A + * Multiple OA configs can be added using @DRM_XE_OBSERVATION_OP_ADD_CONFIG. A * particular config can be specified when opening an OA stream using * @DRM_XE_OA_PROPERTY_OA_METRIC_SET property. */ @@ -1645,8 +1646,9 @@ struct drm_xe_oa_config { /** * struct drm_xe_oa_stream_status - OA stream status returned from - * @DRM_XE_PERF_IOCTL_STATUS perf fd ioctl. Userspace can call the ioctl to - * query stream status in response to EIO errno from perf fd read(). + * @DRM_XE_OBSERVATION_IOCTL_STATUS observation stream fd ioctl. Userspace can + * call the ioctl to query stream status in response to EIO errno from + * observation fd read(). */ struct drm_xe_oa_stream_status { /** @extensions: Pointer to the first extension struct, if any */ @@ -1665,7 +1667,7 @@ struct drm_xe_oa_stream_status { /** * struct drm_xe_oa_stream_info - OA stream info returned from - * @DRM_XE_PERF_IOCTL_INFO perf fd ioctl + * @DRM_XE_OBSERVATION_IOCTL_INFO observation stream fd ioctl */ struct drm_xe_oa_stream_info { /** @extensions: Pointer to the first extension struct, if any */ -- cgit v1.2.3 From 01e0cfc994be484ddcb9e121e353e51d8bb837c0 Mon Sep 17 00:00:00 2001 From: Thomas Hellström Date: Fri, 5 Jul 2024 15:28:28 +0200 Subject: drm/xe: Use write-back caching mode for system memory on DGFX MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The caching mode for buffer objects with VRAM as a possible placement was forced to write-combined, regardless of placement. However, write-combined system memory is expensive to allocate and even though it is pooled, the pool is expensive to shrink, since it involves global CPU TLB flushes. Moreover write-combined system memory from TTM is only reliably available on x86 and DGFX doesn't have an x86 restriction. So regardless of the cpu caching mode selected for a bo, internally use write-back caching mode for system memory on DGFX. Coherency is maintained, but user-space clients may perceive a difference in cpu access speeds. v2: - Update RB- and Ack tags. - Rephrase wording in xe_drm.h (Matt Roper) v3: - Really rephrase wording. Signed-off-by: Thomas Hellström Fixes: 622f709ca629 ("drm/xe/uapi: Add support for CPU caching mode") Cc: Pallavi Mishra Cc: Matthew Auld Cc: dri-devel@lists.freedesktop.org Cc: Joonas Lahtinen Cc: Effie Yu Cc: Matthew Brost Cc: Maarten Lankhorst Cc: Jose Souza Cc: Michal Mrozek Cc: # v6.8+ Acked-by: Matthew Auld Acked-by: José Roberto de Souza Reviewed-by: Rodrigo Vivi Fixes: 622f709ca629 ("drm/xe/uapi: Add support for CPU caching mode") Acked-by: Michal Mrozek Acked-by: Effie Yu #On chat Link: https://patchwork.freedesktop.org/patch/msgid/20240705132828.27714-1-thomas.hellstrom@linux.intel.com --- drivers/gpu/drm/xe/xe_bo.c | 47 ++++++++++++++++++++++++---------------- drivers/gpu/drm/xe/xe_bo_types.h | 3 ++- include/uapi/drm/xe_drm.h | 8 ++++++- 3 files changed, 37 insertions(+), 21 deletions(-) (limited to 'include/uapi') diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c index 65c696966e96..31192d983d9e 100644 --- a/drivers/gpu/drm/xe/xe_bo.c +++ b/drivers/gpu/drm/xe/xe_bo.c @@ -343,7 +343,7 @@ static struct ttm_tt *xe_ttm_tt_create(struct ttm_buffer_object *ttm_bo, struct xe_device *xe = xe_bo_device(bo); struct xe_ttm_tt *tt; unsigned long extra_pages; - enum ttm_caching caching; + enum ttm_caching caching = ttm_cached; int err; tt = kzalloc(sizeof(*tt), GFP_KERNEL); @@ -357,26 +357,35 @@ static struct ttm_tt *xe_ttm_tt_create(struct ttm_buffer_object *ttm_bo, extra_pages = DIV_ROUND_UP(xe_device_ccs_bytes(xe, bo->size), PAGE_SIZE); - switch (bo->cpu_caching) { - case DRM_XE_GEM_CPU_CACHING_WC: - caching = ttm_write_combined; - break; - default: - caching = ttm_cached; - break; - } - - WARN_ON((bo->flags & XE_BO_FLAG_USER) && !bo->cpu_caching); - /* - * Display scanout is always non-coherent with the CPU cache. - * - * For Xe_LPG and beyond, PPGTT PTE lookups are also non-coherent and - * require a CPU:WC mapping. + * DGFX system memory is always WB / ttm_cached, since + * other caching modes are only supported on x86. DGFX + * GPU system memory accesses are always coherent with the + * CPU. */ - if ((!bo->cpu_caching && bo->flags & XE_BO_FLAG_SCANOUT) || - (xe->info.graphics_verx100 >= 1270 && bo->flags & XE_BO_FLAG_PAGETABLE)) - caching = ttm_write_combined; + if (!IS_DGFX(xe)) { + switch (bo->cpu_caching) { + case DRM_XE_GEM_CPU_CACHING_WC: + caching = ttm_write_combined; + break; + default: + caching = ttm_cached; + break; + } + + WARN_ON((bo->flags & XE_BO_FLAG_USER) && !bo->cpu_caching); + + /* + * Display scanout is always non-coherent with the CPU cache. + * + * For Xe_LPG and beyond, PPGTT PTE lookups are also + * non-coherent and require a CPU:WC mapping. + */ + if ((!bo->cpu_caching && bo->flags & XE_BO_FLAG_SCANOUT) || + (xe->info.graphics_verx100 >= 1270 && + bo->flags & XE_BO_FLAG_PAGETABLE)) + caching = ttm_write_combined; + } if (bo->flags & XE_BO_FLAG_NEEDS_UC) { /* diff --git a/drivers/gpu/drm/xe/xe_bo_types.h b/drivers/gpu/drm/xe/xe_bo_types.h index 02d68873558a..ebc8abf7930a 100644 --- a/drivers/gpu/drm/xe/xe_bo_types.h +++ b/drivers/gpu/drm/xe/xe_bo_types.h @@ -68,7 +68,8 @@ struct xe_bo { /** * @cpu_caching: CPU caching mode. Currently only used for userspace - * objects. + * objects. Exceptions are system memory on DGFX, which is always + * WB. */ u16 cpu_caching; diff --git a/include/uapi/drm/xe_drm.h b/include/uapi/drm/xe_drm.h index 33544ef78d3e..19619d4952a8 100644 --- a/include/uapi/drm/xe_drm.h +++ b/include/uapi/drm/xe_drm.h @@ -783,7 +783,13 @@ struct drm_xe_gem_create { #define DRM_XE_GEM_CPU_CACHING_WC 2 /** * @cpu_caching: The CPU caching mode to select for this object. If - * mmaping the object the mode selected here will also be used. + * mmaping the object the mode selected here will also be used. The + * exception is when mapping system memory (including data evicted + * to system) on discrete GPUs. The caching mode selected will + * then be overridden to DRM_XE_GEM_CPU_CACHING_WB, and coherency + * between GPU- and CPU is guaranteed. The caching mode of + * existing CPU-mappings will be updated transparently to + * user-space clients. */ __u16 cpu_caching; /** @pad: MBZ */ -- cgit v1.2.3 From 76299a557f36d624ca32500173ad7856e1ad93c0 Mon Sep 17 00:00:00 2001 From: Mario Limonciello Date: Wed, 3 Jul 2024 00:17:21 -0500 Subject: drm: Introduce 'power saving policy' drm property The `power saving policy` DRM property is an optional property that can be added to a connector by a driver. This property is for compositors to indicate intent of policy of whether a driver can use power saving features that may compromise the experience intended by the compositor. Acked-by: Leo Li Signed-off-by: Mario Limonciello Signed-off-by: Hamza Mahfooz Link: https://patchwork.freedesktop.org/patch/msgid/20240703051722.328-2-mario.limonciello@amd.com --- drivers/gpu/drm/drm_connector.c | 48 +++++++++++++++++++++++++++++++++++++++++ include/drm/drm_connector.h | 2 ++ include/drm/drm_mode_config.h | 5 +++++ include/uapi/drm/drm_mode.h | 7 ++++++ 4 files changed, 62 insertions(+) (limited to 'include/uapi') diff --git a/drivers/gpu/drm/drm_connector.c b/drivers/gpu/drm/drm_connector.c index b4f4d2f908d1..7c44e3a1d8e0 100644 --- a/drivers/gpu/drm/drm_connector.c +++ b/drivers/gpu/drm/drm_connector.c @@ -1043,6 +1043,11 @@ static const struct drm_prop_enum_list drm_scaling_mode_enum_list[] = { { DRM_MODE_SCALE_ASPECT, "Full aspect" }, }; +static const struct drm_prop_enum_list drm_power_saving_policy_enum_list[] = { + { __builtin_ffs(DRM_MODE_REQUIRE_COLOR_ACCURACY) - 1, "Require color accuracy" }, + { __builtin_ffs(DRM_MODE_REQUIRE_LOW_LATENCY) - 1, "Require low latency" }, +}; + static const struct drm_prop_enum_list drm_aspect_ratio_enum_list[] = { { DRM_MODE_PICTURE_ASPECT_NONE, "Automatic" }, { DRM_MODE_PICTURE_ASPECT_4_3, "4:3" }, @@ -1629,6 +1634,16 @@ EXPORT_SYMBOL(drm_hdmi_connector_get_output_format_name); * * Drivers can set up these properties by calling * drm_mode_create_tv_margin_properties(). + * power saving policy: + * This property is used to set the power saving policy for the connector. + * This property is populated with a bitmask of optional requirements set + * by the drm master for the drm driver to respect: + * - "Require color accuracy": Disable power saving features that will + * affect color fidelity. + * For example: Hardware assisted backlight modulation. + * - "Require low latency": Disable power saving features that will + * affect latency. + * For example: Panel self refresh (PSR) */ int drm_connector_create_standard_properties(struct drm_device *dev) @@ -2131,6 +2146,39 @@ int drm_mode_create_scaling_mode_property(struct drm_device *dev) } EXPORT_SYMBOL(drm_mode_create_scaling_mode_property); +/** + * drm_mode_create_power_saving_policy_property - create power saving policy property + * @dev: DRM device + * @supported_policies: bitmask of supported power saving policies + * + * Called by a driver the first time it's needed, must be attached to desired + * connectors. + * + * Returns: %0 + */ +int drm_mode_create_power_saving_policy_property(struct drm_device *dev, + uint64_t supported_policies) +{ + struct drm_property *power_saving; + + if (dev->mode_config.power_saving_policy) + return 0; + WARN_ON((supported_policies & DRM_MODE_POWER_SAVING_POLICY_ALL) == 0); + + power_saving = + drm_property_create_bitmask(dev, 0, "power saving policy", + drm_power_saving_policy_enum_list, + ARRAY_SIZE(drm_power_saving_policy_enum_list), + supported_policies); + if (!power_saving) + return -ENOMEM; + + dev->mode_config.power_saving_policy = power_saving; + + return 0; +} +EXPORT_SYMBOL(drm_mode_create_power_saving_policy_property); + /** * DOC: Variable refresh properties * diff --git a/include/drm/drm_connector.h b/include/drm/drm_connector.h index e3fa43291f44..5ad735253413 100644 --- a/include/drm/drm_connector.h +++ b/include/drm/drm_connector.h @@ -2267,6 +2267,8 @@ int drm_mode_create_dp_colorspace_property(struct drm_connector *connector, u32 supported_colorspaces); int drm_mode_create_content_type_property(struct drm_device *dev); int drm_mode_create_suggested_offset_properties(struct drm_device *dev); +int drm_mode_create_power_saving_policy_property(struct drm_device *dev, + uint64_t supported_policies); int drm_connector_set_path_property(struct drm_connector *connector, const char *path); diff --git a/include/drm/drm_mode_config.h b/include/drm/drm_mode_config.h index ab0f167474b1..150f9a3b649f 100644 --- a/include/drm/drm_mode_config.h +++ b/include/drm/drm_mode_config.h @@ -969,6 +969,11 @@ struct drm_mode_config { */ struct drm_atomic_state *suspend_state; + /** + * @power_saving_policy: bitmask for power saving policy requests. + */ + struct drm_property *power_saving_policy; + const struct drm_mode_config_helper_funcs *helper_private; }; diff --git a/include/uapi/drm/drm_mode.h b/include/uapi/drm/drm_mode.h index d390011b89b4..880303c2ad97 100644 --- a/include/uapi/drm/drm_mode.h +++ b/include/uapi/drm/drm_mode.h @@ -152,6 +152,13 @@ extern "C" { #define DRM_MODE_SCALE_CENTER 2 /* Centered, no scaling */ #define DRM_MODE_SCALE_ASPECT 3 /* Full screen, preserve aspect */ +/* power saving policy options */ +#define DRM_MODE_REQUIRE_COLOR_ACCURACY BIT(0) /* Compositor requires color accuracy */ +#define DRM_MODE_REQUIRE_LOW_LATENCY BIT(1) /* Compositor requires low latency */ + +#define DRM_MODE_POWER_SAVING_POLICY_ALL (DRM_MODE_REQUIRE_COLOR_ACCURACY |\ + DRM_MODE_REQUIRE_LOW_LATENCY) + /* Dithering mode options */ #define DRM_MODE_DITHERING_OFF 0 #define DRM_MODE_DITHERING_ON 1 -- cgit v1.2.3 From 7108b4a589cd6d3a2c1276fd610b3500f46de66a Mon Sep 17 00:00:00 2001 From: Lucas De Marchi Date: Wed, 10 Jul 2024 15:02:27 -0700 Subject: drm/xe/uapi: Expose SIMD16 EU mask in topology query MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit PVC, Xe2 and later platforms have 16-wide EUs. We were implicitly reporting for PVC the number of 16-wide EUs without giving userspace any hint that they were different than for other platforms. Xe2 and later also have 16-wide, but in those cases the reported number would correspond to the 8-wide count. To avoid confusion and make sure the right number is used by userspace depending on the platform, add a new item to the topology query and drop the one that is not available. The new mask reported for both PVC and Xe2 should now match the numbers reported via hwconfig. v2: Use a different topo item with EU type in its name to report the new mask instead of adding the type itself as the item (Matt Roper) Reviewed-by: Matt Roper Acked-by: José Roberto de Souza Acked-by: Mateusz Jablonski Acked-by: Wenbin Lu Acked-by: Effie Yu Acked-by: Rodrigo Vivi Link: https://patchwork.freedesktop.org/patch/msgid/20240710220446.2169797-1-lucas.demarchi@intel.com Signed-off-by: Lucas De Marchi --- drivers/gpu/drm/xe/xe_gt_topology.c | 27 ++++++++++++++++++++++----- drivers/gpu/drm/xe/xe_gt_types.h | 11 +++++++++++ drivers/gpu/drm/xe/xe_query.c | 4 +++- include/uapi/drm/xe_drm.h | 10 +++++++++- 4 files changed, 45 insertions(+), 7 deletions(-) (limited to 'include/uapi') diff --git a/drivers/gpu/drm/xe/xe_gt_topology.c b/drivers/gpu/drm/xe/xe_gt_topology.c index 25ff03ab8448..5a1559edf3e9 100644 --- a/drivers/gpu/drm/xe/xe_gt_topology.c +++ b/drivers/gpu/drm/xe/xe_gt_topology.c @@ -6,6 +6,7 @@ #include "xe_gt_topology.h" #include +#include #include "regs/xe_gt_regs.h" #include "xe_assert.h" @@ -31,7 +32,7 @@ load_dss_mask(struct xe_gt *gt, xe_dss_mask_t mask, int numregs, ...) } static void -load_eu_mask(struct xe_gt *gt, xe_eu_mask_t mask) +load_eu_mask(struct xe_gt *gt, xe_eu_mask_t mask, enum xe_gt_eu_type *eu_type) { struct xe_device *xe = gt_to_xe(gt); u32 reg_val = xe_mmio_read32(gt, XELP_EU_ENABLE); @@ -47,11 +48,13 @@ load_eu_mask(struct xe_gt *gt, xe_eu_mask_t mask) if (GRAPHICS_VERx100(xe) < 1250) reg_val = ~reg_val & XELP_EU_MASK; - /* On PVC, one bit = one EU */ - if (GRAPHICS_VERx100(xe) == 1260) { + if (GRAPHICS_VERx100(xe) == 1260 || GRAPHICS_VER(xe) >= 20) { + /* SIMD16 EUs, one bit == one EU */ + *eu_type = XE_GT_EU_TYPE_SIMD16; val = reg_val; } else { - /* All other platforms, one bit = 2 EU */ + /* SIMD8 EUs, one bit == 2 EU */ + *eu_type = XE_GT_EU_TYPE_SIMD8; for (i = 0; i < fls(reg_val); i++) if (reg_val & BIT(i)) val |= 0x3 << 2 * i; @@ -213,7 +216,7 @@ xe_gt_topology_init(struct xe_gt *gt) XEHP_GT_COMPUTE_DSS_ENABLE, XEHPC_GT_COMPUTE_DSS_ENABLE_EXT, XE2_GT_COMPUTE_DSS_2); - load_eu_mask(gt, gt->fuse_topo.eu_mask_per_dss); + load_eu_mask(gt, gt->fuse_topo.eu_mask_per_dss, >->fuse_topo.eu_type); load_l3_bank_mask(gt, gt->fuse_topo.l3_bank_mask); p = drm_dbg_printer(>_to_xe(gt)->drm, DRM_UT_DRIVER, "GT topology"); @@ -221,6 +224,18 @@ xe_gt_topology_init(struct xe_gt *gt) xe_gt_topology_dump(gt, &p); } +static const char *eu_type_to_str(enum xe_gt_eu_type eu_type) +{ + switch (eu_type) { + case XE_GT_EU_TYPE_SIMD16: + return "simd16"; + case XE_GT_EU_TYPE_SIMD8: + return "simd8"; + } + + unreachable(); +} + void xe_gt_topology_dump(struct xe_gt *gt, struct drm_printer *p) { @@ -231,6 +246,8 @@ xe_gt_topology_dump(struct xe_gt *gt, struct drm_printer *p) drm_printf(p, "EU mask per DSS: %*pb\n", XE_MAX_EU_FUSE_BITS, gt->fuse_topo.eu_mask_per_dss); + drm_printf(p, "EU type: %s\n", + eu_type_to_str(gt->fuse_topo.eu_type)); drm_printf(p, "L3 bank mask: %*pb\n", XE_MAX_L3_BANK_MASK_BITS, gt->fuse_topo.l3_bank_mask); diff --git a/drivers/gpu/drm/xe/xe_gt_types.h b/drivers/gpu/drm/xe/xe_gt_types.h index 38a0d0e178c8..ef68c4a92972 100644 --- a/drivers/gpu/drm/xe/xe_gt_types.h +++ b/drivers/gpu/drm/xe/xe_gt_types.h @@ -27,6 +27,11 @@ enum xe_gt_type { XE_GT_TYPE_MEDIA, }; +enum xe_gt_eu_type { + XE_GT_EU_TYPE_SIMD8, + XE_GT_EU_TYPE_SIMD16, +}; + #define XE_MAX_DSS_FUSE_REGS 3 #define XE_MAX_DSS_FUSE_BITS (32 * XE_MAX_DSS_FUSE_REGS) #define XE_MAX_EU_FUSE_REGS 1 @@ -343,6 +348,12 @@ struct xe_gt { /** @fuse_topo.l3_bank_mask: L3 bank mask */ xe_l3_bank_mask_t l3_bank_mask; + + /** + * @fuse_topo.eu_type: type/width of EU stored in + * fuse_topo.eu_mask_per_dss + */ + enum xe_gt_eu_type eu_type; } fuse_topo; /** @steering: register steering for individual HW units */ diff --git a/drivers/gpu/drm/xe/xe_query.c b/drivers/gpu/drm/xe/xe_query.c index 4e01df6b1b7a..73ef6e4c2dc9 100644 --- a/drivers/gpu/drm/xe/xe_query.c +++ b/drivers/gpu/drm/xe/xe_query.c @@ -518,7 +518,9 @@ static int query_gt_topology(struct xe_device *xe, if (err) return err; - topo.type = DRM_XE_TOPO_EU_PER_DSS; + topo.type = gt->fuse_topo.eu_type == XE_GT_EU_TYPE_SIMD16 ? + DRM_XE_TOPO_SIMD16_EU_PER_DSS : + DRM_XE_TOPO_EU_PER_DSS; err = copy_mask(&query_ptr, &topo, gt->fuse_topo.eu_mask_per_dss, sizeof(gt->fuse_topo.eu_mask_per_dss)); diff --git a/include/uapi/drm/xe_drm.h b/include/uapi/drm/xe_drm.h index 19619d4952a8..29425d7fdc77 100644 --- a/include/uapi/drm/xe_drm.h +++ b/include/uapi/drm/xe_drm.h @@ -517,7 +517,14 @@ struct drm_xe_query_gt_list { * available per Dual Sub Slices (DSS). For example a query response * containing the following in mask: * ``EU_PER_DSS ff ff 00 00 00 00 00 00`` - * means each DSS has 16 EU. + * means each DSS has 16 SIMD8 EUs. This type may be omitted if device + * doesn't have SIMD8 EUs. + * - %DRM_XE_TOPO_SIMD16_EU_PER_DSS - To query the mask of SIMD16 Execution + * Units (EU) available per Dual Sub Slices (DSS). For example a query + * response containing the following in mask: + * ``SIMD16_EU_PER_DSS ff ff 00 00 00 00 00 00`` + * means each DSS has 16 SIMD16 EUs. This type may be omitted if device + * doesn't have SIMD16 EUs. */ struct drm_xe_query_topology_mask { /** @gt_id: GT ID the mask is associated with */ @@ -527,6 +534,7 @@ struct drm_xe_query_topology_mask { #define DRM_XE_TOPO_DSS_COMPUTE 2 #define DRM_XE_TOPO_L3_BANK 3 #define DRM_XE_TOPO_EU_PER_DSS 4 +#define DRM_XE_TOPO_SIMD16_EU_PER_DSS 5 /** @type: type of mask */ __u16 type; -- cgit v1.2.3 From 7214da0ed2220a2b9ad22aa77a5974cdd2a62799 Mon Sep 17 00:00:00 2001 From: Dmitry Osipenko Date: Sun, 14 Jul 2024 23:55:02 +0300 Subject: drm/virtio: Add DRM capset definition Define DRM native context capset in the VirtIO-GPU protocol header. Signed-off-by: Dmitry Osipenko Reviewed-by: Rob Clark Reviewed-by: Pierre-Eric Pelloux-Prayer Link: https://patchwork.freedesktop.org/patch/msgid/20240714205502.3409718-1-dmitry.osipenko@collabora.com --- include/uapi/linux/virtio_gpu.h | 1 + 1 file changed, 1 insertion(+) (limited to 'include/uapi') diff --git a/include/uapi/linux/virtio_gpu.h b/include/uapi/linux/virtio_gpu.h index 0e21f3998108..bf2c9cabd207 100644 --- a/include/uapi/linux/virtio_gpu.h +++ b/include/uapi/linux/virtio_gpu.h @@ -311,6 +311,7 @@ struct virtio_gpu_cmd_submit { #define VIRTIO_GPU_CAPSET_VIRGL2 2 /* 3 is reserved for gfxstream */ #define VIRTIO_GPU_CAPSET_VENUS 4 +#define VIRTIO_GPU_CAPSET_DRM 6 /* VIRTIO_GPU_CMD_GET_CAPSET_INFO */ struct virtio_gpu_get_capset_info { -- cgit v1.2.3 From e06b71b2313a00579ba64a1cc43ad29d64cb8d4c Mon Sep 17 00:00:00 2001 From: Jonathan Kim Date: Tue, 21 May 2024 13:22:15 -0400 Subject: drm/amdkfd: allow users to target recommended SDMA engines Certain GPUs have better copy performance over xGMI on specific SDMA engines depending on the source and destination GPU. Allow users to create SDMA queues on these recommended engines. Close to 2x overall performance has been observed with this optimization. Signed-off-by: Jonathan Kim Reviewed-by: Felix Kuehling Signed-off-by: Alex Deucher --- drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 16 +++++++ .../gpu/drm/amd/amdkfd/kfd_device_queue_manager.c | 38 +++++++++++++++- drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 5 ++- .../gpu/drm/amd/amdkfd/kfd_process_queue_manager.c | 1 + drivers/gpu/drm/amd/amdkfd/kfd_topology.c | 52 ++++++++++++++++++++++ drivers/gpu/drm/amd/amdkfd/kfd_topology.h | 1 + include/uapi/linux/kfd_ioctl.h | 6 ++- 7 files changed, 116 insertions(+), 3 deletions(-) (limited to 'include/uapi') diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c index 65a37ac5a0f0..0622ebd7e8ef 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c @@ -255,6 +255,7 @@ static int set_queue_properties_from_user(struct queue_properties *q_properties, args->ctx_save_restore_address; q_properties->ctx_save_restore_area_size = args->ctx_save_restore_size; q_properties->ctl_stack_size = args->ctl_stack_size; + q_properties->sdma_engine_id = args->sdma_engine_id; if (args->queue_type == KFD_IOC_QUEUE_TYPE_COMPUTE || args->queue_type == KFD_IOC_QUEUE_TYPE_COMPUTE_AQL) q_properties->type = KFD_QUEUE_TYPE_COMPUTE; @@ -262,6 +263,8 @@ static int set_queue_properties_from_user(struct queue_properties *q_properties, q_properties->type = KFD_QUEUE_TYPE_SDMA; else if (args->queue_type == KFD_IOC_QUEUE_TYPE_SDMA_XGMI) q_properties->type = KFD_QUEUE_TYPE_SDMA_XGMI; + else if (args->queue_type == KFD_IOC_QUEUE_TYPE_SDMA_BY_ENG_ID) + q_properties->type = KFD_QUEUE_TYPE_SDMA_BY_ENG_ID; else return -ENOTSUPP; @@ -333,6 +336,18 @@ static int kfd_ioctl_create_queue(struct file *filep, struct kfd_process *p, goto err_bind_process; } + if (q_properties.type == KFD_QUEUE_TYPE_SDMA_BY_ENG_ID) { + int max_sdma_eng_id = kfd_get_num_sdma_engines(dev) + + kfd_get_num_xgmi_sdma_engines(dev) - 1; + + if (q_properties.sdma_engine_id > max_sdma_eng_id) { + err = -EINVAL; + pr_err("sdma_engine_id %i exceeds maximum id of %i\n", + q_properties.sdma_engine_id, max_sdma_eng_id); + goto err_sdma_engine_id; + } + } + if (!pdd->qpd.proc_doorbells) { err = kfd_alloc_process_doorbells(dev->kfd, pdd); if (err) { @@ -387,6 +402,7 @@ static int kfd_ioctl_create_queue(struct file *filep, struct kfd_process *p, err_create_queue: kfd_queue_release_buffers(pdd, &q_properties); err_acquire_queue_buf: +err_sdma_engine_id: err_bind_process: err_pdd: mutex_unlock(&p->mutex); diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c index fdc76c24b2e7..f0bfeb35246f 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c @@ -1532,6 +1532,41 @@ static int allocate_sdma_queue(struct device_queue_manager *dqm, q->sdma_id % kfd_get_num_xgmi_sdma_engines(dqm->dev); q->properties.sdma_queue_id = q->sdma_id / kfd_get_num_xgmi_sdma_engines(dqm->dev); + } else if (q->properties.type == KFD_QUEUE_TYPE_SDMA_BY_ENG_ID) { + int i, num_queues, num_engines, eng_offset = 0, start_engine; + bool free_bit_found = false, is_xgmi = false; + + if (q->properties.sdma_engine_id < kfd_get_num_sdma_engines(dqm->dev)) { + num_queues = get_num_sdma_queues(dqm); + num_engines = kfd_get_num_sdma_engines(dqm->dev); + q->properties.type = KFD_QUEUE_TYPE_SDMA; + } else { + num_queues = get_num_xgmi_sdma_queues(dqm); + num_engines = kfd_get_num_xgmi_sdma_engines(dqm->dev); + eng_offset = kfd_get_num_sdma_engines(dqm->dev); + q->properties.type = KFD_QUEUE_TYPE_SDMA_XGMI; + is_xgmi = true; + } + + /* Scan available bit based on target engine ID. */ + start_engine = q->properties.sdma_engine_id - eng_offset; + for (i = start_engine; i < num_queues; i += num_engines) { + + if (!test_bit(i, is_xgmi ? dqm->xgmi_sdma_bitmap : dqm->sdma_bitmap)) + continue; + + clear_bit(i, is_xgmi ? dqm->xgmi_sdma_bitmap : dqm->sdma_bitmap); + q->sdma_id = i; + q->properties.sdma_queue_id = q->sdma_id / num_engines; + free_bit_found = true; + break; + } + + if (!free_bit_found) { + dev_err(dev, "No more SDMA queue to allocate for target ID %i\n", + q->properties.sdma_engine_id); + return -ENOMEM; + } } pr_debug("SDMA engine id: %d\n", q->properties.sdma_engine_id); @@ -1784,7 +1819,8 @@ static int create_queue_cpsch(struct device_queue_manager *dqm, struct queue *q, } if (q->properties.type == KFD_QUEUE_TYPE_SDMA || - q->properties.type == KFD_QUEUE_TYPE_SDMA_XGMI) { + q->properties.type == KFD_QUEUE_TYPE_SDMA_XGMI || + q->properties.type == KFD_QUEUE_TYPE_SDMA_BY_ENG_ID) { dqm_lock(dqm); retval = allocate_sdma_queue(dqm, q, qd ? &qd->sdma_id : NULL); dqm_unlock(dqm); diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h index b5cae48dff66..4190fa339913 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h +++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h @@ -414,13 +414,16 @@ enum kfd_unmap_queues_filter { * @KFD_QUEUE_TYPE_DIQ: DIQ queue type. * * @KFD_QUEUE_TYPE_SDMA_XGMI: Special SDMA queue for XGMI interface. + * + * @KFD_QUEUE_TYPE_SDMA_BY_ENG_ID: SDMA user mode queue with target SDMA engine ID. */ enum kfd_queue_type { KFD_QUEUE_TYPE_COMPUTE, KFD_QUEUE_TYPE_SDMA, KFD_QUEUE_TYPE_HIQ, KFD_QUEUE_TYPE_DIQ, - KFD_QUEUE_TYPE_SDMA_XGMI + KFD_QUEUE_TYPE_SDMA_XGMI, + KFD_QUEUE_TYPE_SDMA_BY_ENG_ID }; enum kfd_queue_format { diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c index 9995dbb43359..f732ee35b531 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c @@ -366,6 +366,7 @@ int pqm_create_queue(struct process_queue_manager *pqm, switch (type) { case KFD_QUEUE_TYPE_SDMA: case KFD_QUEUE_TYPE_SDMA_XGMI: + case KFD_QUEUE_TYPE_SDMA_BY_ENG_ID: /* SDMA queues are always allocated statically no matter * which scheduler mode is used. We also do not need to * check whether a SDMA queue can be allocated here, because diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_topology.c b/drivers/gpu/drm/amd/amdkfd/kfd_topology.c index a9b3eda65a2c..40771f8752cb 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_topology.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_topology.c @@ -292,6 +292,8 @@ static ssize_t iolink_show(struct kobject *kobj, struct attribute *attr, iolink->max_bandwidth); sysfs_show_32bit_prop(buffer, offs, "recommended_transfer_size", iolink->rec_transfer_size); + sysfs_show_32bit_prop(buffer, offs, "recommended_sdma_engine_id_mask", + iolink->rec_sdma_eng_id_mask); sysfs_show_32bit_prop(buffer, offs, "flags", iolink->flags); return offs; @@ -1265,6 +1267,55 @@ static void kfd_set_iolink_non_coherent(struct kfd_topology_device *to_dev, } } +#define REC_SDMA_NUM_GPU 8 +static const int rec_sdma_eng_map[REC_SDMA_NUM_GPU][REC_SDMA_NUM_GPU] = { + { -1, 14, 12, 2, 4, 8, 10, 6 }, + { 14, -1, 2, 10, 8, 4, 6, 12 }, + { 10, 2, -1, 12, 14, 6, 4, 8 }, + { 2, 12, 10, -1, 6, 14, 8, 4 }, + { 4, 8, 14, 6, -1, 10, 12, 2 }, + { 8, 4, 6, 14, 12, -1, 2, 10 }, + { 10, 6, 4, 8, 12, 2, -1, 14 }, + { 6, 12, 8, 4, 2, 10, 14, -1 }}; + +static void kfd_set_recommended_sdma_engines(struct kfd_topology_device *to_dev, + struct kfd_iolink_properties *outbound_link, + struct kfd_iolink_properties *inbound_link) +{ + struct kfd_node *gpu = outbound_link->gpu; + struct amdgpu_device *adev = gpu->adev; + int num_xgmi_nodes = adev->gmc.xgmi.num_physical_nodes; + bool support_rec_eng = !amdgpu_sriov_vf(adev) && to_dev->gpu && + adev->aid_mask && num_xgmi_nodes && + (amdgpu_xcp_query_partition_mode(adev->xcp_mgr, AMDGPU_XCP_FL_NONE) == + AMDGPU_SPX_PARTITION_MODE) && + (!(adev->flags & AMD_IS_APU) && num_xgmi_nodes == 8); + + if (support_rec_eng) { + int src_socket_id = adev->gmc.xgmi.physical_node_id; + int dst_socket_id = to_dev->gpu->adev->gmc.xgmi.physical_node_id; + + outbound_link->rec_sdma_eng_id_mask = + 1 << rec_sdma_eng_map[src_socket_id][dst_socket_id]; + inbound_link->rec_sdma_eng_id_mask = + 1 << rec_sdma_eng_map[dst_socket_id][src_socket_id]; + } else { + int num_sdma_eng = kfd_get_num_sdma_engines(gpu); + int i, eng_offset = 0; + + if (outbound_link->iolink_type == CRAT_IOLINK_TYPE_XGMI && + kfd_get_num_xgmi_sdma_engines(gpu) && to_dev->gpu) { + eng_offset = num_sdma_eng; + num_sdma_eng = kfd_get_num_xgmi_sdma_engines(gpu); + } + + for (i = 0; i < num_sdma_eng; i++) { + outbound_link->rec_sdma_eng_id_mask |= (1 << (i + eng_offset)); + inbound_link->rec_sdma_eng_id_mask |= (1 << (i + eng_offset)); + } + } +} + static void kfd_fill_iolink_non_crat_info(struct kfd_topology_device *dev) { struct kfd_iolink_properties *link, *inbound_link; @@ -1303,6 +1354,7 @@ static void kfd_fill_iolink_non_crat_info(struct kfd_topology_device *dev) inbound_link->flags = CRAT_IOLINK_FLAGS_ENABLED; kfd_set_iolink_no_atomics(peer_dev, dev, inbound_link); kfd_set_iolink_non_coherent(peer_dev, link, inbound_link); + kfd_set_recommended_sdma_engines(peer_dev, link, inbound_link); } } diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_topology.h b/drivers/gpu/drm/amd/amdkfd/kfd_topology.h index 43ba0d32e5bd..155b5c410af1 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_topology.h +++ b/drivers/gpu/drm/amd/amdkfd/kfd_topology.h @@ -125,6 +125,7 @@ struct kfd_iolink_properties { uint32_t min_bandwidth; uint32_t max_bandwidth; uint32_t rec_transfer_size; + uint32_t rec_sdma_eng_id_mask; uint32_t flags; struct kfd_node *gpu; struct kobject *kobj; diff --git a/include/uapi/linux/kfd_ioctl.h b/include/uapi/linux/kfd_ioctl.h index 285a36601dc9..71a7ce5f2d4c 100644 --- a/include/uapi/linux/kfd_ioctl.h +++ b/include/uapi/linux/kfd_ioctl.h @@ -42,9 +42,10 @@ * - 1.14 - Update kfd_event_data * - 1.15 - Enable managing mappings in compute VMs with GEM_VA ioctl * - 1.16 - Add contiguous VRAM allocation flag + * - 1.17 - Add SDMA queue creation with target SDMA engine ID */ #define KFD_IOCTL_MAJOR_VERSION 1 -#define KFD_IOCTL_MINOR_VERSION 16 +#define KFD_IOCTL_MINOR_VERSION 17 struct kfd_ioctl_get_version_args { __u32 major_version; /* from KFD */ @@ -56,6 +57,7 @@ struct kfd_ioctl_get_version_args { #define KFD_IOC_QUEUE_TYPE_SDMA 0x1 #define KFD_IOC_QUEUE_TYPE_COMPUTE_AQL 0x2 #define KFD_IOC_QUEUE_TYPE_SDMA_XGMI 0x3 +#define KFD_IOC_QUEUE_TYPE_SDMA_BY_ENG_ID 0x4 #define KFD_MAX_QUEUE_PERCENTAGE 100 #define KFD_MAX_QUEUE_PRIORITY 15 @@ -78,6 +80,8 @@ struct kfd_ioctl_create_queue_args { __u64 ctx_save_restore_address; /* to KFD */ __u32 ctx_save_restore_size; /* to KFD */ __u32 ctl_stack_size; /* to KFD */ + __u32 sdma_engine_id; /* to KFD */ + __u32 pad; }; struct kfd_ioctl_destroy_queue_args { -- cgit v1.2.3 From 7543ae2269a855683a39af57048035f44bc8ef9c Mon Sep 17 00:00:00 2001 From: Wouter Verhelst Date: Thu, 25 Jul 2024 18:45:36 +0200 Subject: nbd: add support for rotational devices The NBD protocol defines the flag NBD_FLAG_ROTATIONAL to flag that the export in use should be treated as a rotational device. Add support for that flag to the kernel driver. Signed-off-by: Wouter Verhelst Reviewed-by: Eric Blake Link: https://lore.kernel.org/r/20240725164536.1275851-1-w@uter.be Signed-off-by: Jens Axboe --- drivers/block/nbd.c | 3 +++ include/uapi/linux/nbd.h | 3 ++- 2 files changed, 5 insertions(+), 1 deletion(-) (limited to 'include/uapi') diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c index 41a90150b501..5b1811b1ba5f 100644 --- a/drivers/block/nbd.c +++ b/drivers/block/nbd.c @@ -350,6 +350,9 @@ static int __nbd_set_size(struct nbd_device *nbd, loff_t bytesize, lim.features |= BLK_FEAT_WRITE_CACHE; lim.features &= ~BLK_FEAT_FUA; } + if (nbd->config->flags & NBD_FLAG_ROTATIONAL) + lim.features |= BLK_FEAT_ROTATIONAL; + lim.logical_block_size = blksize; lim.physical_block_size = blksize; error = queue_limits_commit_update(nbd->disk->queue, &lim); diff --git a/include/uapi/linux/nbd.h b/include/uapi/linux/nbd.h index 80ce0ef43afd..d75215f2c675 100644 --- a/include/uapi/linux/nbd.h +++ b/include/uapi/linux/nbd.h @@ -51,8 +51,9 @@ enum { #define NBD_FLAG_READ_ONLY (1 << 1) /* device is read-only */ #define NBD_FLAG_SEND_FLUSH (1 << 2) /* can flush writeback cache */ #define NBD_FLAG_SEND_FUA (1 << 3) /* send FUA (forced unit access) */ -/* there is a gap here to match userspace */ +#define NBD_FLAG_ROTATIONAL (1 << 4) /* device is rotational */ #define NBD_FLAG_SEND_TRIM (1 << 5) /* send trim/discard */ +/* there is a gap here to match userspace */ #define NBD_FLAG_CAN_MULTI_CONN (1 << 8) /* Server supports multiple connections per export. */ /* values for cmd flags in the upper 16 bits of request type */ -- cgit v1.2.3 From f58872f45c36ded048bccc22701b0986019c24d8 Mon Sep 17 00:00:00 2001 From: Marcelo Schmitt Date: Fri, 12 Jul 2024 16:20:36 -0300 Subject: spi: Enable controllers to extend the SPI protocol with MOSI idle configuration The behavior of an SPI controller data output line (SDO or MOSI or COPI (Controller Output Peripheral Input) for disambiguation) is usually not specified when the controller is not clocking out data on SCLK edges. However, there do exist SPI peripherals that require specific MOSI line state when data is not being clocked out of the controller. Conventional SPI controllers may set the MOSI line on SCLK edges then bring it low when no data is going out or leave the line the state of the last transfer bit. More elaborated controllers are capable to set the MOSI idle state according to different configurable levels and thus are more suitable for interfacing with demanding peripherals. Add SPI mode bits to allow peripherals to request explicit MOSI idle state when needed. When supporting a particular MOSI idle configuration, the data output line state is expected to remain at the configured level when the controller is not clocking out data. When a device that needs a specific MOSI idle state is identified, its driver should request the MOSI idle configuration by setting the proper SPI mode bit. Acked-by: Nuno Sa Reviewed-by: Jonathan Cameron Reviewed-by: David Lechner Tested-by: David Lechner Signed-off-by: Marcelo Schmitt Link: https://patch.msgid.link/9802160b5e5baed7f83ee43ac819cb757a19be55.1720810545.git.marcelo.schmitt@analog.com Signed-off-by: Mark Brown --- Documentation/spi/spi-summary.rst | 83 +++++++++++++++++++++++++++++++++++++++ drivers/spi/spi.c | 6 +++ include/uapi/linux/spi/spi.h | 5 ++- 3 files changed, 92 insertions(+), 2 deletions(-) (limited to 'include/uapi') diff --git a/Documentation/spi/spi-summary.rst b/Documentation/spi/spi-summary.rst index 7f8accfae6f9..6e21e6f86912 100644 --- a/Documentation/spi/spi-summary.rst +++ b/Documentation/spi/spi-summary.rst @@ -614,6 +614,89 @@ queue, and then start some asynchronous transfer engine (unless it's already running). +Extensions to the SPI protocol +------------------------------ +The fact that SPI doesn't have a formal specification or standard permits chip +manufacturers to implement the SPI protocol in slightly different ways. In most +cases, SPI protocol implementations from different vendors are compatible among +each other. For example, in SPI mode 0 (CPOL=0, CPHA=0) the bus lines may behave +like the following: + +:: + + nCSx ___ ___ + \_________________________________________________________________/ + • • + • • + SCLK ___ ___ ___ ___ ___ ___ ___ ___ + _______/ \___/ \___/ \___/ \___/ \___/ \___/ \___/ \_____ + • : ; : ; : ; : ; : ; : ; : ; : ; • + • : ; : ; : ; : ; : ; : ; : ; : ; • + MOSI XXX__________ _______ _______ ________XXX + 0xA5 XXX__/ 1 \_0_____/ 1 \_0_______0_____/ 1 \_0_____/ 1 \_XXX + • ; ; ; ; ; ; ; ; • + • ; ; ; ; ; ; ; ; • + MISO XXX__________ _______________________ _______ XXX + 0xBA XXX__/ 1 \_____0_/ 1 1 1 \_____0__/ 1 \____0__XXX + +Legend:: + + • marks the start/end of transmission; + : marks when data is clocked into the peripheral; + ; marks when data is clocked into the controller; + X marks when line states are not specified. + +In some few cases, chips extend the SPI protocol by specifying line behaviors +that other SPI protocols don't (e.g. data line state for when CS is not +asserted). Those distinct SPI protocols, modes, and configurations are supported +by different SPI mode flags. + +MOSI idle state configuration +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Common SPI protocol implementations don't specify any state or behavior for the +MOSI line when the controller is not clocking out data. However, there do exist +peripherals that require specific MOSI line state when data is not being clocked +out. For example, if the peripheral expects the MOSI line to be high when the +controller is not clocking out data (``SPI_MOSI_IDLE_HIGH``), then a transfer in +SPI mode 0 would look like the following: + +:: + + nCSx ___ ___ + \_________________________________________________________________/ + • • + • • + SCLK ___ ___ ___ ___ ___ ___ ___ ___ + _______/ \___/ \___/ \___/ \___/ \___/ \___/ \___/ \_____ + • : ; : ; : ; : ; : ; : ; : ; : ; • + • : ; : ; : ; : ; : ; : ; : ; : ; • + MOSI _____ _______ _______ _______________ ___ + 0x56 \_0_____/ 1 \_0_____/ 1 \_0_____/ 1 1 \_0_____/ + • ; ; ; ; ; ; ; ; • + • ; ; ; ; ; ; ; ; • + MISO XXX__________ _______________________ _______ XXX + 0xBA XXX__/ 1 \_____0_/ 1 1 1 \_____0__/ 1 \____0__XXX + +Legend:: + + • marks the start/end of transmission; + : marks when data is clocked into the peripheral; + ; marks when data is clocked into the controller; + X marks when line states are not specified. + +In this extension to the usual SPI protocol, the MOSI line state is specified to +be kept high when CS is asserted but the controller is not clocking out data to +the peripheral and also when CS is not asserted. + +Peripherals that require this extension must request it by setting the +``SPI_MOSI_IDLE_HIGH`` bit into the mode attribute of their ``struct +spi_device`` and call spi_setup(). Controllers that support this extension +should indicate it by setting ``SPI_MOSI_IDLE_HIGH`` in the mode_bits attribute +of their ``struct spi_controller``. The configuration to idle MOSI low is +analogous but uses the ``SPI_MOSI_IDLE_LOW`` mode bit. + + THANKS TO --------- Contributors to Linux-SPI discussions include (in alphabetical order, diff --git a/drivers/spi/spi.c b/drivers/spi/spi.c index 6ebe5dd9bbb1..60a2b5a0c85d 100644 --- a/drivers/spi/spi.c +++ b/drivers/spi/spi.c @@ -3921,6 +3921,12 @@ int spi_setup(struct spi_device *spi) (SPI_TX_DUAL | SPI_TX_QUAD | SPI_TX_OCTAL | SPI_RX_DUAL | SPI_RX_QUAD | SPI_RX_OCTAL))) return -EINVAL; + /* Check against conflicting MOSI idle configuration */ + if ((spi->mode & SPI_MOSI_IDLE_LOW) && (spi->mode & SPI_MOSI_IDLE_HIGH)) { + dev_err(&spi->dev, + "setup: MOSI configured to idle low and high at the same time.\n"); + return -EINVAL; + } /* * Help drivers fail *cleanly* when they need options * that aren't supported with their current controller. diff --git a/include/uapi/linux/spi/spi.h b/include/uapi/linux/spi/spi.h index ca56e477d161..ee4ac812b8f8 100644 --- a/include/uapi/linux/spi/spi.h +++ b/include/uapi/linux/spi/spi.h @@ -28,7 +28,8 @@ #define SPI_RX_OCTAL _BITUL(14) /* receive with 8 wires */ #define SPI_3WIRE_HIZ _BITUL(15) /* high impedance turnaround */ #define SPI_RX_CPHA_FLIP _BITUL(16) /* flip CPHA on Rx only xfer */ -#define SPI_MOSI_IDLE_LOW _BITUL(17) /* leave mosi line low when idle */ +#define SPI_MOSI_IDLE_LOW _BITUL(17) /* leave MOSI line low when idle */ +#define SPI_MOSI_IDLE_HIGH _BITUL(18) /* leave MOSI line high when idle */ /* * All the bits defined above should be covered by SPI_MODE_USER_MASK. @@ -38,6 +39,6 @@ * These bits must not overlap. A static assert check should make sure of that. * If adding extra bits, make sure to increase the bit index below as well. */ -#define SPI_MODE_USER_MASK (_BITUL(18) - 1) +#define SPI_MODE_USER_MASK (_BITUL(19) - 1) #endif /* _UAPI_SPI_H */ -- cgit v1.2.3 From ba386777a30b38dabcc7fb8a89ec2869a09915f7 Mon Sep 17 00:00:00 2001 From: Vignesh Balasubramanian Date: Thu, 25 Jul 2024 21:40:18 +0530 Subject: x86/elf: Add a new FPU buffer layout info to x86 core files Add a new .note section containing type, size, offset and flags of every xfeature that is present. This information will be used by debuggers to understand the XSAVE layout of the machine where the core file has been dumped, and to read XSAVE registers, especially during cross-platform debugging. The XSAVE layouts of modern AMD and Intel CPUs differ, especially since Memory Protection Keys and the AVX-512 features have been inculcated into the AMD CPUs. Since AMD never adopted (and hence never left room in the XSAVE layout for) the Intel MPX feature, tools like GDB had assumed a fixed XSAVE layout matching that of Intel (based on the XCR0 mask). Hence, core dumps from AMD CPUs didn't match the known size for the XCR0 mask. This resulted in GDB and other tools not being able to access the values of the AVX-512 and PKRU registers on AMD CPUs. To solve this, an interim solution has been accepted into GDB, and is already a part of GDB 14, see https://sourceware.org/pipermail/gdb-patches/2023-March/198081.html. But it depends on heuristics based on the total XSAVE register set size and the XCR0 mask to infer the layouts of the various register blocks for core dumps, and hence, is not a foolproof mechanism to determine the layout of the XSAVE area. Therefore, add a new core dump note in order to allow GDB/LLDB and other relevant tools to determine the layout of the XSAVE area of the machine where the corefile was dumped. The new core dump note (which is being proposed as a per-process .note section), NT_X86_XSAVE_LAYOUT (0x205) contains an array of structures. Each structure describes an individual extended feature containing offset, size and flags in this format: struct x86_xfeat_component { u32 type; u32 size; u32 offset; u32 flags; }; and in an independent manner, allowing for future extensions without depending on hw arch specifics like CPUID etc. [ bp: Massage commit message, zap trailing whitespace. ] Co-developed-by: Jini Susan George Signed-off-by: Jini Susan George Co-developed-by: Borislav Petkov (AMD) Signed-off-by: Borislav Petkov (AMD) Signed-off-by: Vignesh Balasubramanian Link: https://lore.kernel.org/r/20240725161017.112111-2-vigbalas@amd.com --- arch/x86/Kconfig | 1 + arch/x86/include/uapi/asm/elf.h | 16 ++++++++ arch/x86/kernel/fpu/xstate.c | 89 +++++++++++++++++++++++++++++++++++++++++ fs/binfmt_elf.c | 4 +- include/uapi/linux/elf.h | 1 + 5 files changed, 109 insertions(+), 2 deletions(-) create mode 100644 arch/x86/include/uapi/asm/elf.h (limited to 'include/uapi') diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 007bab9f2a0e..c15b4b3fb328 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -107,6 +107,7 @@ config X86 select ARCH_HAS_DEBUG_WX select ARCH_HAS_ZONE_DMA_SET if EXPERT select ARCH_HAVE_NMI_SAFE_CMPXCHG + select ARCH_HAVE_EXTRA_ELF_NOTES select ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE select ARCH_MIGHT_HAVE_ACPI_PDC if ACPI select ARCH_MIGHT_HAVE_PC_PARPORT diff --git a/arch/x86/include/uapi/asm/elf.h b/arch/x86/include/uapi/asm/elf.h new file mode 100644 index 000000000000..468e135fa285 --- /dev/null +++ b/arch/x86/include/uapi/asm/elf.h @@ -0,0 +1,16 @@ +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ +#ifndef _UAPI_ASM_X86_ELF_H +#define _UAPI_ASM_X86_ELF_H + +#include + +struct x86_xfeat_component { + __u32 type; + __u32 size; + __u32 offset; + __u32 flags; +} __packed; + +_Static_assert(sizeof(struct x86_xfeat_component) % 4 == 0, "x86_xfeat_component is not aligned"); + +#endif /* _UAPI_ASM_X86_ELF_H */ diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c index c5a026fee5e0..f3a2e59a28e7 100644 --- a/arch/x86/kernel/fpu/xstate.c +++ b/arch/x86/kernel/fpu/xstate.c @@ -13,6 +13,7 @@ #include #include #include +#include #include #include @@ -23,6 +24,8 @@ #include #include +#include + #include "context.h" #include "internal.h" #include "legacy.h" @@ -1838,3 +1841,89 @@ int proc_pid_arch_status(struct seq_file *m, struct pid_namespace *ns, return 0; } #endif /* CONFIG_PROC_PID_ARCH_STATUS */ + +#ifdef CONFIG_COREDUMP +static const char owner_name[] = "LINUX"; + +/* + * Dump type, size, offset and flag values for every xfeature that is present. + */ +static int dump_xsave_layout_desc(struct coredump_params *cprm) +{ + int num_records = 0; + int i; + + for_each_extended_xfeature(i, fpu_user_cfg.max_features) { + struct x86_xfeat_component xc = { + .type = i, + .size = xstate_sizes[i], + .offset = xstate_offsets[i], + /* reserved for future use */ + .flags = 0, + }; + + if (!dump_emit(cprm, &xc, sizeof(xc))) + return 0; + + num_records++; + } + return num_records; +} + +static u32 get_xsave_desc_size(void) +{ + u32 cnt = 0; + u32 i; + + for_each_extended_xfeature(i, fpu_user_cfg.max_features) + cnt++; + + return cnt * (sizeof(struct x86_xfeat_component)); +} + +int elf_coredump_extra_notes_write(struct coredump_params *cprm) +{ + int num_records = 0; + struct elf_note en; + + if (!fpu_user_cfg.max_features) + return 0; + + en.n_namesz = sizeof(owner_name); + en.n_descsz = get_xsave_desc_size(); + en.n_type = NT_X86_XSAVE_LAYOUT; + + if (!dump_emit(cprm, &en, sizeof(en))) + return 1; + if (!dump_emit(cprm, owner_name, en.n_namesz)) + return 1; + if (!dump_align(cprm, 4)) + return 1; + + num_records = dump_xsave_layout_desc(cprm); + if (!num_records) + return 1; + + /* Total size should be equal to the number of records */ + if ((sizeof(struct x86_xfeat_component) * num_records) != en.n_descsz) + return 1; + + return 0; +} + +int elf_coredump_extra_notes_size(void) +{ + int size; + + if (!fpu_user_cfg.max_features) + return 0; + + /* .note header */ + size = sizeof(struct elf_note); + /* Name plus alignment to 4 bytes */ + size += roundup(sizeof(owner_name), 4); + size += get_xsave_desc_size(); + + return size; +} +#endif /* CONFIG_COREDUMP */ diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c index 19fa49cd9907..01bcbe7fdebd 100644 --- a/fs/binfmt_elf.c +++ b/fs/binfmt_elf.c @@ -2039,7 +2039,7 @@ static int elf_core_dump(struct coredump_params *cprm) { size_t sz = info.size; - /* For cell spufs */ + /* For cell spufs and x86 xstate */ sz += elf_coredump_extra_notes_size(); phdr4note = kmalloc(sizeof(*phdr4note), GFP_KERNEL); @@ -2103,7 +2103,7 @@ static int elf_core_dump(struct coredump_params *cprm) if (!write_note_info(&info, cprm)) goto end_coredump; - /* For cell spufs */ + /* For cell spufs and x86 xstate */ if (elf_coredump_extra_notes_write(cprm)) goto end_coredump; diff --git a/include/uapi/linux/elf.h b/include/uapi/linux/elf.h index b54b313bcf07..e30a9b47dc87 100644 --- a/include/uapi/linux/elf.h +++ b/include/uapi/linux/elf.h @@ -411,6 +411,7 @@ typedef struct elf64_shdr { #define NT_X86_XSTATE 0x202 /* x86 extended state using xsave */ /* Old binutils treats 0x203 as a CET state */ #define NT_X86_SHSTK 0x204 /* x86 SHSTK state */ +#define NT_X86_XSAVE_LAYOUT 0x205 /* XSAVE layout description */ #define NT_S390_HIGH_GPRS 0x300 /* s390 upper register halves */ #define NT_S390_TIMER 0x301 /* s390 timer register */ #define NT_S390_TODCMP 0x302 /* s390 TOD clock comparator register */ -- cgit v1.2.3 From f2881dfdaaa9ec873dbd383ef5512fc31e576cbb Mon Sep 17 00:00:00 2001 From: Geert Uytterhoeven Date: Mon, 29 Jul 2024 11:26:34 +0200 Subject: drm/xe/oa/uapi: Make bit masks unsigned MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit When building with gcc-5: In function ‘decode_oa_format.isra.26’, inlined from ‘xe_oa_set_prop_oa_format’ at drivers/gpu/drm/xe/xe_oa.c:1664:6: ././include/linux/compiler_types.h:510:38: error: call to ‘__compiletime_assert_1336’ declared with attribute error: FIELD_GET: mask is not constant [...] ./include/linux/bitfield.h:155:3: note: in expansion of macro ‘__BF_FIELD_CHECK’ __BF_FIELD_CHECK(_mask, _reg, 0U, "FIELD_GET: "); \ ^ drivers/gpu/drm/xe/xe_oa.c:1573:18: note: in expansion of macro ‘FIELD_GET’ u32 bc_report = FIELD_GET(DRM_XE_OA_FORMAT_MASK_BC_REPORT, fmt); ^ Fixes: b6fd51c62119 ("drm/xe/oa/uapi: Define and parse OA stream properties") Signed-off-by: Geert Uytterhoeven Reviewed-by: Lucas De Marchi Link: https://patchwork.freedesktop.org/patch/msgid/20240729092634.2227611-1-geert+renesas@glider.be Signed-off-by: Lucas De Marchi --- include/uapi/drm/xe_drm.h | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) (limited to 'include/uapi') diff --git a/include/uapi/drm/xe_drm.h b/include/uapi/drm/xe_drm.h index 29425d7fdc77..b6fbe4988f2e 100644 --- a/include/uapi/drm/xe_drm.h +++ b/include/uapi/drm/xe_drm.h @@ -1598,10 +1598,10 @@ enum drm_xe_oa_property_id { * b. Counter select c. Counter size and d. BC report. Also refer to the * oa_formats array in drivers/gpu/drm/xe/xe_oa.c. */ -#define DRM_XE_OA_FORMAT_MASK_FMT_TYPE (0xff << 0) -#define DRM_XE_OA_FORMAT_MASK_COUNTER_SEL (0xff << 8) -#define DRM_XE_OA_FORMAT_MASK_COUNTER_SIZE (0xff << 16) -#define DRM_XE_OA_FORMAT_MASK_BC_REPORT (0xff << 24) +#define DRM_XE_OA_FORMAT_MASK_FMT_TYPE (0xffu << 0) +#define DRM_XE_OA_FORMAT_MASK_COUNTER_SEL (0xffu << 8) +#define DRM_XE_OA_FORMAT_MASK_COUNTER_SIZE (0xffu << 16) +#define DRM_XE_OA_FORMAT_MASK_BC_REPORT (0xffu << 24) /** * @DRM_XE_OA_PROPERTY_OA_PERIOD_EXPONENT: Requests periodic OA unit -- cgit v1.2.3 From d579b04a52a183db47dfcb7a44304d7747d551e1 Mon Sep 17 00:00:00 2001 From: Yu-Ting Tseng Date: Tue, 9 Jul 2024 00:00:47 -0700 Subject: binder: frozen notification Frozen processes present a significant challenge in binder transactions. When a process is frozen, it cannot, by design, accept and/or respond to binder transactions. As a result, the sender needs to adjust its behavior, such as postponing transactions until the peer process unfreezes. However, there is currently no way to subscribe to these state change events, making it impossible to implement frozen-aware behaviors efficiently. Introduce a binder API for subscribing to frozen state change events. This allows programs to react to changes in peer process state, mitigating issues related to binder transactions sent to frozen processes. Implementation details: For a given binder_ref, the state of frozen notification can be one of the followings: 1. Userspace doesn't want a notification. binder_ref->freeze is null. 2. Userspace wants a notification but none is in flight. list_empty(&binder_ref->freeze->work.entry) = true 3. A notification is in flight and waiting to be read by userspace. binder_ref_freeze.sent is false. 4. A notification was read by userspace and kernel is waiting for an ack. binder_ref_freeze.sent is true. When a notification is in flight, new state change events are coalesced into the existing binder_ref_freeze struct. If userspace hasn't picked up the notification yet, the driver simply rewrites the state. Otherwise, the notification is flagged as requiring a resend, which will be performed once userspace acks the original notification that's inflight. See https://r.android.com/3070045 for how userspace is going to use this feature. Signed-off-by: Yu-Ting Tseng Acked-by: Carlos Llamas Link: https://lore.kernel.org/r/20240709070047.4055369-4-yutingtseng@google.com Signed-off-by: Greg Kroah-Hartman --- drivers/android/binder.c | 284 +++++++++++++++++++++++++++++++++++- drivers/android/binder_internal.h | 21 ++- include/uapi/linux/android/binder.h | 36 +++++ 3 files changed, 337 insertions(+), 4 deletions(-) (limited to 'include/uapi') diff --git a/drivers/android/binder.c b/drivers/android/binder.c index f26286e3713e..a8dad2d247c4 100644 --- a/drivers/android/binder.c +++ b/drivers/android/binder.c @@ -1355,6 +1355,7 @@ static void binder_free_ref(struct binder_ref *ref) if (ref->node) binder_free_node(ref->node); kfree(ref->death); + kfree(ref->freeze); kfree(ref); } @@ -3844,6 +3845,155 @@ err_invalid_target_handle: } } +static int +binder_request_freeze_notification(struct binder_proc *proc, + struct binder_thread *thread, + struct binder_handle_cookie *handle_cookie) +{ + struct binder_ref_freeze *freeze; + struct binder_ref *ref; + bool is_frozen; + + freeze = kzalloc(sizeof(*freeze), GFP_KERNEL); + if (!freeze) + return -ENOMEM; + binder_proc_lock(proc); + ref = binder_get_ref_olocked(proc, handle_cookie->handle, false); + if (!ref) { + binder_user_error("%d:%d BC_REQUEST_FREEZE_NOTIFICATION invalid ref %d\n", + proc->pid, thread->pid, handle_cookie->handle); + binder_proc_unlock(proc); + kfree(freeze); + return -EINVAL; + } + + binder_node_lock(ref->node); + + if (ref->freeze || !ref->node->proc) { + binder_user_error("%d:%d invalid BC_REQUEST_FREEZE_NOTIFICATION %s\n", + proc->pid, thread->pid, + ref->freeze ? "already set" : "dead node"); + binder_node_unlock(ref->node); + binder_proc_unlock(proc); + kfree(freeze); + return -EINVAL; + } + binder_inner_proc_lock(ref->node->proc); + is_frozen = ref->node->proc->is_frozen; + binder_inner_proc_unlock(ref->node->proc); + + binder_stats_created(BINDER_STAT_FREEZE); + INIT_LIST_HEAD(&freeze->work.entry); + freeze->cookie = handle_cookie->cookie; + freeze->work.type = BINDER_WORK_FROZEN_BINDER; + freeze->is_frozen = is_frozen; + + ref->freeze = freeze; + + binder_inner_proc_lock(proc); + binder_enqueue_work_ilocked(&ref->freeze->work, &proc->todo); + binder_wakeup_proc_ilocked(proc); + binder_inner_proc_unlock(proc); + + binder_node_unlock(ref->node); + binder_proc_unlock(proc); + return 0; +} + +static int +binder_clear_freeze_notification(struct binder_proc *proc, + struct binder_thread *thread, + struct binder_handle_cookie *handle_cookie) +{ + struct binder_ref_freeze *freeze; + struct binder_ref *ref; + + binder_proc_lock(proc); + ref = binder_get_ref_olocked(proc, handle_cookie->handle, false); + if (!ref) { + binder_user_error("%d:%d BC_CLEAR_FREEZE_NOTIFICATION invalid ref %d\n", + proc->pid, thread->pid, handle_cookie->handle); + binder_proc_unlock(proc); + return -EINVAL; + } + + binder_node_lock(ref->node); + + if (!ref->freeze) { + binder_user_error("%d:%d BC_CLEAR_FREEZE_NOTIFICATION freeze notification not active\n", + proc->pid, thread->pid); + binder_node_unlock(ref->node); + binder_proc_unlock(proc); + return -EINVAL; + } + freeze = ref->freeze; + binder_inner_proc_lock(proc); + if (freeze->cookie != handle_cookie->cookie) { + binder_user_error("%d:%d BC_CLEAR_FREEZE_NOTIFICATION freeze notification cookie mismatch %016llx != %016llx\n", + proc->pid, thread->pid, (u64)freeze->cookie, + (u64)handle_cookie->cookie); + binder_inner_proc_unlock(proc); + binder_node_unlock(ref->node); + binder_proc_unlock(proc); + return -EINVAL; + } + ref->freeze = NULL; + /* + * Take the existing freeze object and overwrite its work type. There are three cases here: + * 1. No pending notification. In this case just add the work to the queue. + * 2. A notification was sent and is pending an ack from userspace. Once an ack arrives, we + * should resend with the new work type. + * 3. A notification is pending to be sent. Since the work is already in the queue, nothing + * needs to be done here. + */ + freeze->work.type = BINDER_WORK_CLEAR_FREEZE_NOTIFICATION; + if (list_empty(&freeze->work.entry)) { + binder_enqueue_work_ilocked(&freeze->work, &proc->todo); + binder_wakeup_proc_ilocked(proc); + } else if (freeze->sent) { + freeze->resend = true; + } + binder_inner_proc_unlock(proc); + binder_node_unlock(ref->node); + binder_proc_unlock(proc); + return 0; +} + +static int +binder_freeze_notification_done(struct binder_proc *proc, + struct binder_thread *thread, + binder_uintptr_t cookie) +{ + struct binder_ref_freeze *freeze = NULL; + struct binder_work *w; + + binder_inner_proc_lock(proc); + list_for_each_entry(w, &proc->delivered_freeze, entry) { + struct binder_ref_freeze *tmp_freeze = + container_of(w, struct binder_ref_freeze, work); + + if (tmp_freeze->cookie == cookie) { + freeze = tmp_freeze; + break; + } + } + if (!freeze) { + binder_user_error("%d:%d BC_FREEZE_NOTIFICATION_DONE %016llx not found\n", + proc->pid, thread->pid, (u64)cookie); + binder_inner_proc_unlock(proc); + return -EINVAL; + } + binder_dequeue_work_ilocked(&freeze->work); + freeze->sent = false; + if (freeze->resend) { + freeze->resend = false; + binder_enqueue_work_ilocked(&freeze->work, &proc->todo); + binder_wakeup_proc_ilocked(proc); + } + binder_inner_proc_unlock(proc); + return 0; +} + /** * binder_free_buf() - free the specified buffer * @proc: binder proc that owns buffer @@ -4327,6 +4477,44 @@ static int binder_thread_write(struct binder_proc *proc, binder_inner_proc_unlock(proc); } break; + case BC_REQUEST_FREEZE_NOTIFICATION: { + struct binder_handle_cookie handle_cookie; + int error; + + if (copy_from_user(&handle_cookie, ptr, sizeof(handle_cookie))) + return -EFAULT; + ptr += sizeof(handle_cookie); + error = binder_request_freeze_notification(proc, thread, + &handle_cookie); + if (error) + return error; + } break; + + case BC_CLEAR_FREEZE_NOTIFICATION: { + struct binder_handle_cookie handle_cookie; + int error; + + if (copy_from_user(&handle_cookie, ptr, sizeof(handle_cookie))) + return -EFAULT; + ptr += sizeof(handle_cookie); + error = binder_clear_freeze_notification(proc, thread, &handle_cookie); + if (error) + return error; + } break; + + case BC_FREEZE_NOTIFICATION_DONE: { + binder_uintptr_t cookie; + int error; + + if (get_user(cookie, (binder_uintptr_t __user *)ptr)) + return -EFAULT; + + ptr += sizeof(cookie); + error = binder_freeze_notification_done(proc, thread, cookie); + if (error) + return error; + } break; + default: pr_err("%d:%d unknown command %u\n", proc->pid, thread->pid, cmd); @@ -4716,6 +4904,46 @@ retry: if (cmd == BR_DEAD_BINDER) goto done; /* DEAD_BINDER notifications can cause transactions */ } break; + + case BINDER_WORK_FROZEN_BINDER: { + struct binder_ref_freeze *freeze; + struct binder_frozen_state_info info; + + memset(&info, 0, sizeof(info)); + freeze = container_of(w, struct binder_ref_freeze, work); + info.is_frozen = freeze->is_frozen; + info.cookie = freeze->cookie; + freeze->sent = true; + binder_enqueue_work_ilocked(w, &proc->delivered_freeze); + binder_inner_proc_unlock(proc); + + if (put_user(BR_FROZEN_BINDER, (uint32_t __user *)ptr)) + return -EFAULT; + ptr += sizeof(uint32_t); + if (copy_to_user(ptr, &info, sizeof(info))) + return -EFAULT; + ptr += sizeof(info); + binder_stat_br(proc, thread, BR_FROZEN_BINDER); + goto done; /* BR_FROZEN_BINDER notifications can cause transactions */ + } break; + + case BINDER_WORK_CLEAR_FREEZE_NOTIFICATION: { + struct binder_ref_freeze *freeze = + container_of(w, struct binder_ref_freeze, work); + binder_uintptr_t cookie = freeze->cookie; + + binder_inner_proc_unlock(proc); + kfree(freeze); + binder_stats_deleted(BINDER_STAT_FREEZE); + if (put_user(BR_CLEAR_FREEZE_NOTIFICATION_DONE, (uint32_t __user *)ptr)) + return -EFAULT; + ptr += sizeof(uint32_t); + if (put_user(cookie, (binder_uintptr_t __user *)ptr)) + return -EFAULT; + ptr += sizeof(binder_uintptr_t); + binder_stat_br(proc, thread, BR_CLEAR_FREEZE_NOTIFICATION_DONE); + } break; + default: binder_inner_proc_unlock(proc); pr_err("%d:%d: bad work type %d\n", @@ -5324,6 +5552,48 @@ static bool binder_txns_pending_ilocked(struct binder_proc *proc) return false; } +static void binder_add_freeze_work(struct binder_proc *proc, bool is_frozen) +{ + struct rb_node *n; + struct binder_ref *ref; + + binder_inner_proc_lock(proc); + for (n = rb_first(&proc->nodes); n; n = rb_next(n)) { + struct binder_node *node; + + node = rb_entry(n, struct binder_node, rb_node); + binder_inner_proc_unlock(proc); + binder_node_lock(node); + hlist_for_each_entry(ref, &node->refs, node_entry) { + /* + * Need the node lock to synchronize + * with new notification requests and the + * inner lock to synchronize with queued + * freeze notifications. + */ + binder_inner_proc_lock(ref->proc); + if (!ref->freeze) { + binder_inner_proc_unlock(ref->proc); + continue; + } + ref->freeze->work.type = BINDER_WORK_FROZEN_BINDER; + if (list_empty(&ref->freeze->work.entry)) { + ref->freeze->is_frozen = is_frozen; + binder_enqueue_work_ilocked(&ref->freeze->work, &ref->proc->todo); + binder_wakeup_proc_ilocked(ref->proc); + } else { + if (ref->freeze->sent && ref->freeze->is_frozen != is_frozen) + ref->freeze->resend = true; + ref->freeze->is_frozen = is_frozen; + } + binder_inner_proc_unlock(ref->proc); + } + binder_node_unlock(node); + binder_inner_proc_lock(proc); + } + binder_inner_proc_unlock(proc); +} + static int binder_ioctl_freeze(struct binder_freeze_info *info, struct binder_proc *target_proc) { @@ -5335,6 +5605,7 @@ static int binder_ioctl_freeze(struct binder_freeze_info *info, target_proc->async_recv = false; target_proc->is_frozen = false; binder_inner_proc_unlock(target_proc); + binder_add_freeze_work(target_proc, false); return 0; } @@ -5367,6 +5638,8 @@ static int binder_ioctl_freeze(struct binder_freeze_info *info, binder_inner_proc_lock(target_proc); target_proc->is_frozen = false; binder_inner_proc_unlock(target_proc); + } else { + binder_add_freeze_work(target_proc, true); } return ret; @@ -5742,6 +6015,7 @@ static int binder_open(struct inode *nodp, struct file *filp) binder_stats_created(BINDER_STAT_PROC); proc->pid = current->group_leader->pid; INIT_LIST_HEAD(&proc->delivered_death); + INIT_LIST_HEAD(&proc->delivered_freeze); INIT_LIST_HEAD(&proc->waiting_threads); filp->private_data = proc; @@ -6293,7 +6567,9 @@ static const char * const binder_return_strings[] = { "BR_FAILED_REPLY", "BR_FROZEN_REPLY", "BR_ONEWAY_SPAM_SUSPECT", - "BR_TRANSACTION_PENDING_FROZEN" + "BR_TRANSACTION_PENDING_FROZEN", + "BR_FROZEN_BINDER", + "BR_CLEAR_FREEZE_NOTIFICATION_DONE", }; static const char * const binder_command_strings[] = { @@ -6316,6 +6592,9 @@ static const char * const binder_command_strings[] = { "BC_DEAD_BINDER_DONE", "BC_TRANSACTION_SG", "BC_REPLY_SG", + "BC_REQUEST_FREEZE_NOTIFICATION", + "BC_CLEAR_FREEZE_NOTIFICATION", + "BC_FREEZE_NOTIFICATION_DONE", }; static const char * const binder_objstat_strings[] = { @@ -6325,7 +6604,8 @@ static const char * const binder_objstat_strings[] = { "ref", "death", "transaction", - "transaction_complete" + "transaction_complete", + "freeze", }; static void print_binder_stats(struct seq_file *m, const char *prefix, diff --git a/drivers/android/binder_internal.h b/drivers/android/binder_internal.h index 7d4fc53f7a73..f8d6be682f23 100644 --- a/drivers/android/binder_internal.h +++ b/drivers/android/binder_internal.h @@ -130,12 +130,13 @@ enum binder_stat_types { BINDER_STAT_DEATH, BINDER_STAT_TRANSACTION, BINDER_STAT_TRANSACTION_COMPLETE, + BINDER_STAT_FREEZE, BINDER_STAT_COUNT }; struct binder_stats { - atomic_t br[_IOC_NR(BR_TRANSACTION_PENDING_FROZEN) + 1]; - atomic_t bc[_IOC_NR(BC_REPLY_SG) + 1]; + atomic_t br[_IOC_NR(BR_CLEAR_FREEZE_NOTIFICATION_DONE) + 1]; + atomic_t bc[_IOC_NR(BC_FREEZE_NOTIFICATION_DONE) + 1]; atomic_t obj_created[BINDER_STAT_COUNT]; atomic_t obj_deleted[BINDER_STAT_COUNT]; }; @@ -160,6 +161,8 @@ struct binder_work { BINDER_WORK_DEAD_BINDER, BINDER_WORK_DEAD_BINDER_AND_CLEAR, BINDER_WORK_CLEAR_DEATH_NOTIFICATION, + BINDER_WORK_FROZEN_BINDER, + BINDER_WORK_CLEAR_FREEZE_NOTIFICATION, } type; }; @@ -276,6 +279,14 @@ struct binder_ref_death { binder_uintptr_t cookie; }; +struct binder_ref_freeze { + struct binder_work work; + binder_uintptr_t cookie; + bool is_frozen:1; + bool sent:1; + bool resend:1; +}; + /** * struct binder_ref_data - binder_ref counts and id * @debug_id: unique ID for the ref @@ -308,6 +319,8 @@ struct binder_ref_data { * @node indicates the node must be freed * @death: pointer to death notification (ref_death) if requested * (protected by @node->lock) + * @freeze: pointer to freeze notification (ref_freeze) if requested + * (protected by @node->lock) * * Structure to track references from procA to target node (on procB). This * structure is unsafe to access without holding @proc->outer_lock. @@ -324,6 +337,7 @@ struct binder_ref { struct binder_proc *proc; struct binder_node *node; struct binder_ref_death *death; + struct binder_ref_freeze *freeze; }; /** @@ -377,6 +391,8 @@ struct binder_ref { * (atomics, no lock needed) * @delivered_death: list of delivered death notification * (protected by @inner_lock) + * @delivered_freeze: list of delivered freeze notification + * (protected by @inner_lock) * @max_threads: cap on number of binder threads * (protected by @inner_lock) * @requested_threads: number of binder threads requested but not @@ -424,6 +440,7 @@ struct binder_proc { struct list_head todo; struct binder_stats stats; struct list_head delivered_death; + struct list_head delivered_freeze; u32 max_threads; int requested_threads; int requested_threads_started; diff --git a/include/uapi/linux/android/binder.h b/include/uapi/linux/android/binder.h index d44a8118b2ed..1fd92021a573 100644 --- a/include/uapi/linux/android/binder.h +++ b/include/uapi/linux/android/binder.h @@ -236,6 +236,12 @@ struct binder_frozen_status_info { __u32 async_recv; }; +struct binder_frozen_state_info { + binder_uintptr_t cookie; + __u32 is_frozen; + __u32 reserved; +}; + /* struct binder_extened_error - extended error information * @id: identifier for the failed operation * @command: command as defined by binder_driver_return_protocol @@ -467,6 +473,17 @@ enum binder_driver_return_protocol { /* * The target of the last async transaction is frozen. No parameters. */ + + BR_FROZEN_BINDER = _IOR('r', 21, struct binder_frozen_state_info), + /* + * The cookie and a boolean (is_frozen) that indicates whether the process + * transitioned into a frozen or an unfrozen state. + */ + + BR_CLEAR_FREEZE_NOTIFICATION_DONE = _IOR('r', 22, binder_uintptr_t), + /* + * void *: cookie + */ }; enum binder_driver_command_protocol { @@ -550,6 +567,25 @@ enum binder_driver_command_protocol { /* * binder_transaction_data_sg: the sent command. */ + + BC_REQUEST_FREEZE_NOTIFICATION = + _IOW('c', 19, struct binder_handle_cookie), + /* + * int: handle + * void *: cookie + */ + + BC_CLEAR_FREEZE_NOTIFICATION = _IOW('c', 20, + struct binder_handle_cookie), + /* + * int: handle + * void *: cookie + */ + + BC_FREEZE_NOTIFICATION_DONE = _IOW('c', 21, binder_uintptr_t), + /* + * void *: cookie + */ }; #endif /* _UAPI_LINUX_BINDER_H */ -- cgit v1.2.3 From b6b242d019ed23195c81cf00eb8290d386efb83f Mon Sep 17 00:00:00 2001 From: Hamza Mahfooz Date: Fri, 2 Aug 2024 10:59:45 -0400 Subject: Revert "drm: Introduce 'power saving policy' drm property" This reverts commit 76299a557f36d624ca32500173ad7856e1ad93c0. It was merged without meeting userspace requirements. Signed-off-by: Hamza Mahfooz Reviewed-by: Harry Wentland Link: https://patchwork.freedesktop.org/patch/msgid/20240802145946.48073-1-hamza.mahfooz@amd.com --- drivers/gpu/drm/drm_connector.c | 48 ----------------------------------------- include/drm/drm_connector.h | 2 -- include/drm/drm_mode_config.h | 5 ----- include/uapi/drm/drm_mode.h | 7 ------ 4 files changed, 62 deletions(-) (limited to 'include/uapi') diff --git a/drivers/gpu/drm/drm_connector.c b/drivers/gpu/drm/drm_connector.c index 7c44e3a1d8e0..b4f4d2f908d1 100644 --- a/drivers/gpu/drm/drm_connector.c +++ b/drivers/gpu/drm/drm_connector.c @@ -1043,11 +1043,6 @@ static const struct drm_prop_enum_list drm_scaling_mode_enum_list[] = { { DRM_MODE_SCALE_ASPECT, "Full aspect" }, }; -static const struct drm_prop_enum_list drm_power_saving_policy_enum_list[] = { - { __builtin_ffs(DRM_MODE_REQUIRE_COLOR_ACCURACY) - 1, "Require color accuracy" }, - { __builtin_ffs(DRM_MODE_REQUIRE_LOW_LATENCY) - 1, "Require low latency" }, -}; - static const struct drm_prop_enum_list drm_aspect_ratio_enum_list[] = { { DRM_MODE_PICTURE_ASPECT_NONE, "Automatic" }, { DRM_MODE_PICTURE_ASPECT_4_3, "4:3" }, @@ -1634,16 +1629,6 @@ EXPORT_SYMBOL(drm_hdmi_connector_get_output_format_name); * * Drivers can set up these properties by calling * drm_mode_create_tv_margin_properties(). - * power saving policy: - * This property is used to set the power saving policy for the connector. - * This property is populated with a bitmask of optional requirements set - * by the drm master for the drm driver to respect: - * - "Require color accuracy": Disable power saving features that will - * affect color fidelity. - * For example: Hardware assisted backlight modulation. - * - "Require low latency": Disable power saving features that will - * affect latency. - * For example: Panel self refresh (PSR) */ int drm_connector_create_standard_properties(struct drm_device *dev) @@ -2146,39 +2131,6 @@ int drm_mode_create_scaling_mode_property(struct drm_device *dev) } EXPORT_SYMBOL(drm_mode_create_scaling_mode_property); -/** - * drm_mode_create_power_saving_policy_property - create power saving policy property - * @dev: DRM device - * @supported_policies: bitmask of supported power saving policies - * - * Called by a driver the first time it's needed, must be attached to desired - * connectors. - * - * Returns: %0 - */ -int drm_mode_create_power_saving_policy_property(struct drm_device *dev, - uint64_t supported_policies) -{ - struct drm_property *power_saving; - - if (dev->mode_config.power_saving_policy) - return 0; - WARN_ON((supported_policies & DRM_MODE_POWER_SAVING_POLICY_ALL) == 0); - - power_saving = - drm_property_create_bitmask(dev, 0, "power saving policy", - drm_power_saving_policy_enum_list, - ARRAY_SIZE(drm_power_saving_policy_enum_list), - supported_policies); - if (!power_saving) - return -ENOMEM; - - dev->mode_config.power_saving_policy = power_saving; - - return 0; -} -EXPORT_SYMBOL(drm_mode_create_power_saving_policy_property); - /** * DOC: Variable refresh properties * diff --git a/include/drm/drm_connector.h b/include/drm/drm_connector.h index 5ad735253413..e3fa43291f44 100644 --- a/include/drm/drm_connector.h +++ b/include/drm/drm_connector.h @@ -2267,8 +2267,6 @@ int drm_mode_create_dp_colorspace_property(struct drm_connector *connector, u32 supported_colorspaces); int drm_mode_create_content_type_property(struct drm_device *dev); int drm_mode_create_suggested_offset_properties(struct drm_device *dev); -int drm_mode_create_power_saving_policy_property(struct drm_device *dev, - uint64_t supported_policies); int drm_connector_set_path_property(struct drm_connector *connector, const char *path); diff --git a/include/drm/drm_mode_config.h b/include/drm/drm_mode_config.h index 150f9a3b649f..ab0f167474b1 100644 --- a/include/drm/drm_mode_config.h +++ b/include/drm/drm_mode_config.h @@ -969,11 +969,6 @@ struct drm_mode_config { */ struct drm_atomic_state *suspend_state; - /** - * @power_saving_policy: bitmask for power saving policy requests. - */ - struct drm_property *power_saving_policy; - const struct drm_mode_config_helper_funcs *helper_private; }; diff --git a/include/uapi/drm/drm_mode.h b/include/uapi/drm/drm_mode.h index 880303c2ad97..d390011b89b4 100644 --- a/include/uapi/drm/drm_mode.h +++ b/include/uapi/drm/drm_mode.h @@ -152,13 +152,6 @@ extern "C" { #define DRM_MODE_SCALE_CENTER 2 /* Centered, no scaling */ #define DRM_MODE_SCALE_ASPECT 3 /* Full screen, preserve aspect */ -/* power saving policy options */ -#define DRM_MODE_REQUIRE_COLOR_ACCURACY BIT(0) /* Compositor requires color accuracy */ -#define DRM_MODE_REQUIRE_LOW_LATENCY BIT(1) /* Compositor requires low latency */ - -#define DRM_MODE_POWER_SAVING_POLICY_ALL (DRM_MODE_REQUIRE_COLOR_ACCURACY |\ - DRM_MODE_REQUIRE_LOW_LATENCY) - /* Dithering mode options */ #define DRM_MODE_DITHERING_OFF 0 #define DRM_MODE_DITHERING_ON 1 -- cgit v1.2.3 From 613f21505b25a4f43f33de00f11afc059bedde2b Mon Sep 17 00:00:00 2001 From: Hans Verkuil Date: Thu, 4 Jul 2024 11:01:51 +0200 Subject: media: cec: core: add new CEC_MSG_FL_REPLY_VENDOR_ID flag If this flag is set, then the reply is expected to consist of the CEC_MSG_VENDOR_COMMAND_WITH_ID opcode followed by the Vendor ID (as used in bytes 1-4 of the message), followed by the struct cec_msg reply field. Note that this assumes that the byte after the Vendor ID is a vendor-specific opcode. This flag makes it easier to wait for replies to vendor commands, using the same CEC framework support for waiting for regular replies. Support for this flag is indicated by setting the new CEC_CAP_REPLY_VENDOR_ID capability. Signed-off-by: Hans Verkuil Signed-off-by: Mauro Carvalho Chehab --- .../media/cec/cec-ioc-adap-g-caps.rst | 6 +++ .../userspace-api/media/cec/cec-ioc-receive.rst | 15 +++++++ drivers/media/cec/core/cec-adap.c | 52 +++++++++++++++------- drivers/media/cec/core/cec-core.c | 2 +- include/media/cec.h | 2 + include/uapi/linux/cec.h | 3 ++ 6 files changed, 64 insertions(+), 16 deletions(-) (limited to 'include/uapi') diff --git a/Documentation/userspace-api/media/cec/cec-ioc-adap-g-caps.rst b/Documentation/userspace-api/media/cec/cec-ioc-adap-g-caps.rst index d5e014ce19b5..1d5248979a6d 100644 --- a/Documentation/userspace-api/media/cec/cec-ioc-adap-g-caps.rst +++ b/Documentation/userspace-api/media/cec/cec-ioc-adap-g-caps.rst @@ -137,6 +137,12 @@ returns the information to the application. The ioctl never fails. - 0x00000100 - If this capability is set, then :ref:`CEC_ADAP_G_CONNECTOR_INFO` can be used. + * .. _`CEC-CAP-REPLY-VENDOR-ID`: + + - ``CEC_CAP_REPLY_VENDOR_ID`` + - 0x00000200 + - If this capability is set, then + :ref:`CEC_MSG_FL_REPLY_VENDOR_ID ` can be used. Return Value ============ diff --git a/Documentation/userspace-api/media/cec/cec-ioc-receive.rst b/Documentation/userspace-api/media/cec/cec-ioc-receive.rst index 364938ad34df..3e6c511e054f 100644 --- a/Documentation/userspace-api/media/cec/cec-ioc-receive.rst +++ b/Documentation/userspace-api/media/cec/cec-ioc-receive.rst @@ -232,6 +232,21 @@ View On' messages from initiator 0xf ('Unregistered') to destination 0 ('TV'). capability. If that is not set, then the ``EPERM`` error code is returned. + * .. _`CEC-MSG-FL-REPLY-VENDOR-ID`: + + - ``CEC_MSG_FL_REPLY_VENDOR_ID`` + - 4 + - This flag is only available if the ``CEC_CAP_REPLY_VENDOR_ID`` capability + is set. If this flag is set, then the reply is expected to consist of + the ``CEC_MSG_VENDOR_COMMAND_WITH_ID`` opcode followed by the Vendor ID + (in bytes 1-4 of the message), followed by the ``struct cec_msg`` + ``reply`` field. + + Note that this assumes that the byte after the Vendor ID is a + vendor-specific opcode. + + This flag makes it easier to wait for replies to vendor commands. + .. tabularcolumns:: |p{5.6cm}|p{0.9cm}|p{10.8cm}| .. _cec-tx-status: diff --git a/drivers/media/cec/core/cec-adap.c b/drivers/media/cec/core/cec-adap.c index da09834990b8..c81b1ed7c08a 100644 --- a/drivers/media/cec/core/cec-adap.c +++ b/drivers/media/cec/core/cec-adap.c @@ -673,8 +673,9 @@ void cec_transmit_done_ts(struct cec_adapter *adap, u8 status, /* Retry this message */ data->attempts -= attempts_made; if (msg->timeout) - dprintk(2, "retransmit: %*ph (attempts: %d, wait for 0x%02x)\n", - msg->len, msg->msg, data->attempts, msg->reply); + dprintk(2, "retransmit: %*ph (attempts: %d, wait for %*ph)\n", + msg->len, msg->msg, data->attempts, + data->match_len, data->match_reply); else dprintk(2, "retransmit: %*ph (attempts: %d)\n", msg->len, msg->msg, data->attempts); @@ -780,6 +781,7 @@ int cec_transmit_msg_fh(struct cec_adapter *adap, struct cec_msg *msg, { struct cec_data *data; bool is_raw = msg_is_raw(msg); + bool reply_vendor_id = msg->flags & CEC_MSG_FL_REPLY_VENDOR_ID; int err; if (adap->devnode.unregistered) @@ -794,12 +796,13 @@ int cec_transmit_msg_fh(struct cec_adapter *adap, struct cec_msg *msg, msg->tx_low_drive_cnt = 0; msg->tx_error_cnt = 0; msg->sequence = 0; + msg->flags &= CEC_MSG_FL_REPLY_TO_FOLLOWERS | CEC_MSG_FL_RAW | + CEC_MSG_FL_REPLY_VENDOR_ID; - if (msg->reply && msg->timeout == 0) { + if ((reply_vendor_id || msg->reply) && msg->timeout == 0) { /* Make sure the timeout isn't 0. */ msg->timeout = 1000; } - msg->flags &= CEC_MSG_FL_REPLY_TO_FOLLOWERS | CEC_MSG_FL_RAW; if (!msg->timeout) msg->flags &= ~CEC_MSG_FL_REPLY_TO_FOLLOWERS; @@ -809,6 +812,11 @@ int cec_transmit_msg_fh(struct cec_adapter *adap, struct cec_msg *msg, dprintk(1, "%s: invalid length %d\n", __func__, msg->len); return -EINVAL; } + if (reply_vendor_id && + (msg->len < 6 || msg->msg[1] != CEC_MSG_VENDOR_COMMAND_WITH_ID)) { + dprintk(1, "%s: message too short or not \n", __func__); + return -EINVAL; + } memset(msg->msg + msg->len, 0, sizeof(msg->msg) - msg->len); @@ -900,8 +908,9 @@ int cec_transmit_msg_fh(struct cec_adapter *adap, struct cec_msg *msg, __func__); return -ENONET; } - if (msg->reply) { - dprintk(1, "%s: invalid msg->reply\n", __func__); + if (reply_vendor_id || msg->reply) { + dprintk(1, "%s: adapter is unconfigured so reply is not supported\n", + __func__); return -EINVAL; } } @@ -923,6 +932,14 @@ int cec_transmit_msg_fh(struct cec_adapter *adap, struct cec_msg *msg, data->fh = fh; data->adap = adap; data->blocking = block; + if (reply_vendor_id) { + memcpy(data->match_reply, msg->msg + 1, 4); + data->match_reply[4] = msg->reply; + data->match_len = 5; + } else if (msg->timeout) { + data->match_reply[0] = msg->reply; + data->match_len = 1; + } init_completion(&data->c); INIT_DELAYED_WORK(&data->work, cec_wait_timeout); @@ -1211,13 +1228,15 @@ void cec_received_msg_ts(struct cec_adapter *adap, if (!abort && dst->msg[1] == CEC_MSG_INITIATE_ARC && (cmd == CEC_MSG_REPORT_ARC_INITIATED || cmd == CEC_MSG_REPORT_ARC_TERMINATED) && - (dst->reply == CEC_MSG_REPORT_ARC_INITIATED || - dst->reply == CEC_MSG_REPORT_ARC_TERMINATED)) + (data->match_reply[0] == CEC_MSG_REPORT_ARC_INITIATED || + data->match_reply[0] == CEC_MSG_REPORT_ARC_TERMINATED)) { dst->reply = cmd; + data->match_reply[0] = cmd; + } /* Does the command match? */ if ((abort && cmd != dst->msg[1]) || - (!abort && cmd != dst->reply)) + (!abort && memcmp(data->match_reply, msg->msg + 1, data->match_len))) continue; /* Does the addressing match? */ @@ -2318,18 +2337,21 @@ int cec_adap_status(struct seq_file *file, void *priv) } data = adap->transmitting; if (data) - seq_printf(file, "transmitting message: %*ph (reply: %02x, timeout: %ums)\n", - data->msg.len, data->msg.msg, data->msg.reply, + seq_printf(file, "transmitting message: %*ph (reply: %*ph, timeout: %ums)\n", + data->msg.len, data->msg.msg, + data->match_len, data->match_reply, data->msg.timeout); seq_printf(file, "pending transmits: %u\n", adap->transmit_queue_sz); list_for_each_entry(data, &adap->transmit_queue, list) { - seq_printf(file, "queued tx message: %*ph (reply: %02x, timeout: %ums)\n", - data->msg.len, data->msg.msg, data->msg.reply, + seq_printf(file, "queued tx message: %*ph (reply: %*ph, timeout: %ums)\n", + data->msg.len, data->msg.msg, + data->match_len, data->match_reply, data->msg.timeout); } list_for_each_entry(data, &adap->wait_queue, list) { - seq_printf(file, "message waiting for reply: %*ph (reply: %02x, timeout: %ums)\n", - data->msg.len, data->msg.msg, data->msg.reply, + seq_printf(file, "message waiting for reply: %*ph (reply: %*ph, timeout: %ums)\n", + data->msg.len, data->msg.msg, + data->match_len, data->match_reply, data->msg.timeout); } diff --git a/drivers/media/cec/core/cec-core.c b/drivers/media/cec/core/cec-core.c index 6f940df0230c..e0756826d629 100644 --- a/drivers/media/cec/core/cec-core.c +++ b/drivers/media/cec/core/cec-core.c @@ -273,7 +273,7 @@ struct cec_adapter *cec_allocate_adapter(const struct cec_adap_ops *ops, adap->cec_pin_is_high = true; adap->log_addrs.cec_version = CEC_OP_CEC_VERSION_2_0; adap->log_addrs.vendor_id = CEC_VENDOR_ID_NONE; - adap->capabilities = caps; + adap->capabilities = caps | CEC_CAP_REPLY_VENDOR_ID; if (debug_phys_addr) adap->capabilities |= CEC_CAP_PHYS_ADDR; adap->needs_hpd = caps & CEC_CAP_NEEDS_HPD; diff --git a/include/media/cec.h b/include/media/cec.h index d131514032f2..07d2ee8a3904 100644 --- a/include/media/cec.h +++ b/include/media/cec.h @@ -66,6 +66,8 @@ struct cec_data { struct list_head xfer_list; struct cec_adapter *adap; struct cec_msg msg; + u8 match_len; + u8 match_reply[5]; struct cec_fh *fh; struct delayed_work work; struct completion c; diff --git a/include/uapi/linux/cec.h b/include/uapi/linux/cec.h index b8e071abaea5..894fffc66f2c 100644 --- a/include/uapi/linux/cec.h +++ b/include/uapi/linux/cec.h @@ -165,6 +165,7 @@ static inline int cec_msg_recv_is_rx_result(const struct cec_msg *msg) /* cec_msg flags field */ #define CEC_MSG_FL_REPLY_TO_FOLLOWERS (1 << 0) #define CEC_MSG_FL_RAW (1 << 1) +#define CEC_MSG_FL_REPLY_VENDOR_ID (1 << 2) /* cec_msg tx/rx_status field */ #define CEC_TX_STATUS_OK (1 << 0) @@ -339,6 +340,8 @@ static inline int cec_is_unconfigured(__u16 log_addr_mask) #define CEC_CAP_MONITOR_PIN (1 << 7) /* CEC_ADAP_G_CONNECTOR_INFO is available */ #define CEC_CAP_CONNECTOR_INFO (1 << 8) +/* CEC_MSG_FL_REPLY_VENDOR_ID is available */ +#define CEC_CAP_REPLY_VENDOR_ID (1 << 9) /** * struct cec_caps - CEC capabilities structure. -- cgit v1.2.3 From 0079c9d1e58a39148e6ce13bda55307ea6aa3a9e Mon Sep 17 00:00:00 2001 From: Takashi Iwai Date: Tue, 6 Aug 2024 09:00:23 +0200 Subject: ALSA: ump: Handle MIDI 1.0 Function Block in MIDI 2.0 protocol The UMP v1.1 spec says in the section 6.2.1: "If a UMP Endpoint declares MIDI 2.0 Protocol but a Function Block represents a MIDI 1.0 connection, then may optionally be used for messages to/from that Function Block." It implies that the driver can (and should) keep MIDI 1.0 CVM exceptionally for those FBs even if UMP Endpoint is running in MIDI 2.0 protocol, and the current driver lacks of it. This patch extends the sequencer port info to indicate a MIDI 1.0 port, and tries to send/receive MIDI 1.0 CVM as is when this port is the source or sink. The sequencer port flag is set by the driver at parsing FBs and GTBs although application can set it to its own user-space clients, too. Link: https://patch.msgid.link/20240806070024.14301-1-tiwai@suse.de Signed-off-by: Takashi Iwai --- include/sound/ump.h | 1 + include/uapi/sound/asequencer.h | 2 ++ sound/core/seq/seq_ports.c | 2 ++ sound/core/seq/seq_ports.h | 2 ++ sound/core/seq/seq_ump_client.c | 4 +++- sound/core/seq/seq_ump_convert.c | 11 ++++++----- sound/core/ump.c | 10 +++++++++- 7 files changed, 25 insertions(+), 7 deletions(-) (limited to 'include/uapi') diff --git a/include/sound/ump.h b/include/sound/ump.h index 7f68056acdff..7484f62fb234 100644 --- a/include/sound/ump.h +++ b/include/sound/ump.h @@ -18,6 +18,7 @@ struct snd_ump_group { unsigned int dir_bits; /* directions */ bool active; /* activeness */ bool valid; /* valid group (referred by blocks) */ + bool is_midi1; /* belongs to a MIDI1 FB */ char name[64]; /* group name */ }; diff --git a/include/uapi/sound/asequencer.h b/include/uapi/sound/asequencer.h index 39b37edcf813..bc30c1f2a109 100644 --- a/include/uapi/sound/asequencer.h +++ b/include/uapi/sound/asequencer.h @@ -461,6 +461,8 @@ struct snd_seq_remove_events { #define SNDRV_SEQ_PORT_FLG_TIMESTAMP (1<<1) #define SNDRV_SEQ_PORT_FLG_TIME_REAL (1<<2) +#define SNDRV_SEQ_PORT_FLG_IS_MIDI1 (1<<3) /* Keep MIDI 1.0 protocol */ + /* port direction */ #define SNDRV_SEQ_PORT_DIR_UNKNOWN 0 #define SNDRV_SEQ_PORT_DIR_INPUT 1 diff --git a/sound/core/seq/seq_ports.c b/sound/core/seq/seq_ports.c index ca631ca4f2c6..535290f24eed 100644 --- a/sound/core/seq/seq_ports.c +++ b/sound/core/seq/seq_ports.c @@ -362,6 +362,8 @@ int snd_seq_set_port_info(struct snd_seq_client_port * port, port->direction |= SNDRV_SEQ_PORT_DIR_OUTPUT; } + port->is_midi1 = !!(info->flags & SNDRV_SEQ_PORT_FLG_IS_MIDI1); + return 0; } diff --git a/sound/core/seq/seq_ports.h b/sound/core/seq/seq_ports.h index b111382f697a..20a26bffecb7 100644 --- a/sound/core/seq/seq_ports.h +++ b/sound/core/seq/seq_ports.h @@ -87,6 +87,8 @@ struct snd_seq_client_port { unsigned char direction; unsigned char ump_group; + bool is_midi1; /* keep MIDI 1.0 protocol */ + #if IS_ENABLED(CONFIG_SND_SEQ_UMP) struct snd_seq_ump_midi2_bank midi2_bank[16]; /* per channel */ #endif diff --git a/sound/core/seq/seq_ump_client.c b/sound/core/seq/seq_ump_client.c index 637154bd1a3f..e5d3f4d206bf 100644 --- a/sound/core/seq/seq_ump_client.c +++ b/sound/core/seq/seq_ump_client.c @@ -189,6 +189,8 @@ static void fill_port_info(struct snd_seq_port_info *port, port->ump_group = group->group + 1; if (!group->active) port->capability |= SNDRV_SEQ_PORT_CAP_INACTIVE; + if (group->is_midi1) + port->flags |= SNDRV_SEQ_PORT_FLG_IS_MIDI1; port->type = SNDRV_SEQ_PORT_TYPE_MIDI_GENERIC | SNDRV_SEQ_PORT_TYPE_MIDI_UMP | SNDRV_SEQ_PORT_TYPE_HARDWARE | @@ -223,7 +225,7 @@ static int seq_ump_group_init(struct seq_ump_client *client, int group_index) return -ENOMEM; fill_port_info(port, client, group); - port->flags = SNDRV_SEQ_PORT_FLG_GIVEN_PORT; + port->flags |= SNDRV_SEQ_PORT_FLG_GIVEN_PORT; memset(&pcallbacks, 0, sizeof(pcallbacks)); pcallbacks.owner = THIS_MODULE; pcallbacks.private_data = client; diff --git a/sound/core/seq/seq_ump_convert.c b/sound/core/seq/seq_ump_convert.c index d9dacfbe4a9a..90656980ae26 100644 --- a/sound/core/seq/seq_ump_convert.c +++ b/sound/core/seq/seq_ump_convert.c @@ -595,12 +595,13 @@ int snd_seq_deliver_from_ump(struct snd_seq_client *source, type = ump_message_type(ump_ev->ump[0]); if (snd_seq_client_is_ump(dest)) { - if (snd_seq_client_is_midi2(dest) && - type == UMP_MSG_TYPE_MIDI1_CHANNEL_VOICE) + bool is_midi2 = snd_seq_client_is_midi2(dest) && + !dest_port->is_midi1; + + if (is_midi2 && type == UMP_MSG_TYPE_MIDI1_CHANNEL_VOICE) return cvt_ump_midi1_to_midi2(dest, dest_port, event, atomic, hop); - else if (!snd_seq_client_is_midi2(dest) && - type == UMP_MSG_TYPE_MIDI2_CHANNEL_VOICE) + else if (!is_midi2 && type == UMP_MSG_TYPE_MIDI2_CHANNEL_VOICE) return cvt_ump_midi2_to_midi1(dest, dest_port, event, atomic, hop); /* non-EP port and different group is set? */ @@ -1254,7 +1255,7 @@ int snd_seq_deliver_to_ump(struct snd_seq_client *source, return 0; /* group filtered - skip the event */ if (event->type == SNDRV_SEQ_EVENT_SYSEX) return cvt_sysex_to_ump(dest, dest_port, event, atomic, hop); - else if (snd_seq_client_is_midi2(dest)) + else if (snd_seq_client_is_midi2(dest) && !dest_port->is_midi1) return cvt_to_ump_midi2(dest, dest_port, event, atomic, hop); else return cvt_to_ump_midi1(dest, dest_port, event, atomic, hop); diff --git a/sound/core/ump.c b/sound/core/ump.c index e39e9cda4912..c7c3581bbbbc 100644 --- a/sound/core/ump.c +++ b/sound/core/ump.c @@ -538,6 +538,7 @@ static void update_group_attrs(struct snd_ump_endpoint *ump) group->active = 0; group->group = i; group->valid = false; + group->is_midi1 = false; } list_for_each_entry(fb, &ump->block_list, list) { @@ -548,6 +549,8 @@ static void update_group_attrs(struct snd_ump_endpoint *ump) group->valid = true; if (fb->info.active) group->active = 1; + if (fb->info.flags & SNDRV_UMP_BLOCK_IS_MIDI1) + group->is_midi1 = true; switch (fb->info.direction) { case SNDRV_UMP_DIR_INPUT: group->dir_bits |= (1 << SNDRV_RAWMIDI_STREAM_INPUT); @@ -1156,6 +1159,7 @@ static int process_legacy_output(struct snd_ump_endpoint *ump, struct snd_rawmidi_substream *substream; struct ump_cvt_to_ump *ctx; const int dir = SNDRV_RAWMIDI_STREAM_OUTPUT; + unsigned int protocol; unsigned char c; int group, size = 0; @@ -1168,9 +1172,13 @@ static int process_legacy_output(struct snd_ump_endpoint *ump, if (!substream) continue; ctx = &ump->out_cvts[group]; + protocol = ump->info.protocol; + if ((protocol & SNDRV_UMP_EP_INFO_PROTO_MIDI2) && + ump->groups[group].is_midi1) + protocol = SNDRV_UMP_EP_INFO_PROTO_MIDI1; while (!ctx->ump_bytes && snd_rawmidi_transmit(substream, &c, 1) > 0) - snd_ump_convert_to_ump(ctx, group, ump->info.protocol, c); + snd_ump_convert_to_ump(ctx, group, protocol, c); if (ctx->ump_bytes && ctx->ump_bytes <= count) { size = ctx->ump_bytes; memcpy(buffer, ctx->ump, size); -- cgit v1.2.3 From 599f6899051cb70c4e0aa9fd591b9ee220cb6f14 Mon Sep 17 00:00:00 2001 From: Hans Verkuil Date: Wed, 7 Aug 2024 09:22:10 +0200 Subject: media: uapi/linux/cec.h: cec_msg_set_reply_to: zero flags The cec_msg_set_reply_to() helper function never zeroed the struct cec_msg flags field, this can cause unexpected behavior if flags was uninitialized to begin with. Signed-off-by: Hans Verkuil Fixes: 0dbacebede1e ("[media] cec: move the CEC framework out of staging and to media") Cc: Signed-off-by: Mauro Carvalho Chehab --- include/uapi/linux/cec.h | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) (limited to 'include/uapi') diff --git a/include/uapi/linux/cec.h b/include/uapi/linux/cec.h index 894fffc66f2c..b2af1dddd4d7 100644 --- a/include/uapi/linux/cec.h +++ b/include/uapi/linux/cec.h @@ -132,6 +132,8 @@ static inline void cec_msg_init(struct cec_msg *msg, * Set the msg destination to the orig initiator and the msg initiator to the * orig destination. Note that msg and orig may be the same pointer, in which * case the change is done in place. + * + * It also zeroes the reply, timeout and flags fields. */ static inline void cec_msg_set_reply_to(struct cec_msg *msg, struct cec_msg *orig) @@ -139,7 +141,9 @@ static inline void cec_msg_set_reply_to(struct cec_msg *msg, /* The destination becomes the initiator and vice versa */ msg->msg[0] = (cec_msg_destination(orig) << 4) | cec_msg_initiator(orig); - msg->reply = msg->timeout = 0; + msg->reply = 0; + msg->timeout = 0; + msg->flags = 0; } /** -- cgit v1.2.3 From 3882dccf48f9fbe787b5df3187d708ef348ac860 Mon Sep 17 00:00:00 2001 From: Alan Maguire Date: Thu, 8 Aug 2024 16:05:57 +0100 Subject: bpf/bpf_get,set_sockopt: add option to set TCP-BPF sock ops flags Currently the only opportunity to set sock ops flags dictating which callbacks fire for a socket is from within a TCP-BPF sockops program. This is problematic if the connection is already set up as there is no further chance to specify callbacks for that socket. Add TCP_BPF_SOCK_OPS_CB_FLAGS to bpf_setsockopt() and bpf_getsockopt() to allow users to specify callbacks later, either via an iterator over sockets or via a socket-specific program triggered by a setsockopt() on the socket. Previous discussion on this here [1]. [1] https://lore.kernel.org/bpf/f42f157b-6e52-dd4d-3d97-9b86c84c0b00@oracle.com/ Signed-off-by: Alan Maguire Link: https://lore.kernel.org/r/20240808150558.1035626-2-alan.maguire@oracle.com Signed-off-by: Martin KaFai Lau --- include/uapi/linux/bpf.h | 3 ++- net/core/filter.c | 16 ++++++++++++++++ tools/include/uapi/linux/bpf.h | 3 ++- 3 files changed, 20 insertions(+), 2 deletions(-) (limited to 'include/uapi') diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 35bcf52dbc65..e05b39e39c3f 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -2851,7 +2851,7 @@ union bpf_attr { * **TCP_SYNCNT**, **TCP_USER_TIMEOUT**, **TCP_NOTSENT_LOWAT**, * **TCP_NODELAY**, **TCP_MAXSEG**, **TCP_WINDOW_CLAMP**, * **TCP_THIN_LINEAR_TIMEOUTS**, **TCP_BPF_DELACK_MAX**, - * **TCP_BPF_RTO_MIN**. + * **TCP_BPF_RTO_MIN**, **TCP_BPF_SOCK_OPS_CB_FLAGS**. * * **IPPROTO_IP**, which supports *optname* **IP_TOS**. * * **IPPROTO_IPV6**, which supports the following *optname*\ s: * **IPV6_TCLASS**, **IPV6_AUTOFLOWLABEL**. @@ -7080,6 +7080,7 @@ enum { TCP_BPF_SYN = 1005, /* Copy the TCP header */ TCP_BPF_SYN_IP = 1006, /* Copy the IP[46] and TCP header */ TCP_BPF_SYN_MAC = 1007, /* Copy the MAC, IP[46], and TCP header */ + TCP_BPF_SOCK_OPS_CB_FLAGS = 1008, /* Get or Set TCP sock ops flags */ }; enum { diff --git a/net/core/filter.c b/net/core/filter.c index f3c72cf86099..d96a50f3f016 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -5278,6 +5278,11 @@ static int bpf_sol_tcp_setsockopt(struct sock *sk, int optname, return -EINVAL; inet_csk(sk)->icsk_rto_min = timeout; break; + case TCP_BPF_SOCK_OPS_CB_FLAGS: + if (val & ~(BPF_SOCK_OPS_ALL_CB_FLAGS)) + return -EINVAL; + tp->bpf_sock_ops_cb_flags = val; + break; default: return -EINVAL; } @@ -5366,6 +5371,17 @@ static int sol_tcp_sockopt(struct sock *sk, int optname, if (*optlen < 1) return -EINVAL; break; + case TCP_BPF_SOCK_OPS_CB_FLAGS: + if (*optlen != sizeof(int)) + return -EINVAL; + if (getopt) { + struct tcp_sock *tp = tcp_sk(sk); + int cb_flags = tp->bpf_sock_ops_cb_flags; + + memcpy(optval, &cb_flags, *optlen); + return 0; + } + return bpf_sol_tcp_setsockopt(sk, optname, optval, *optlen); default: if (getopt) return -EINVAL; diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 35bcf52dbc65..e05b39e39c3f 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -2851,7 +2851,7 @@ union bpf_attr { * **TCP_SYNCNT**, **TCP_USER_TIMEOUT**, **TCP_NOTSENT_LOWAT**, * **TCP_NODELAY**, **TCP_MAXSEG**, **TCP_WINDOW_CLAMP**, * **TCP_THIN_LINEAR_TIMEOUTS**, **TCP_BPF_DELACK_MAX**, - * **TCP_BPF_RTO_MIN**. + * **TCP_BPF_RTO_MIN**, **TCP_BPF_SOCK_OPS_CB_FLAGS**. * * **IPPROTO_IP**, which supports *optname* **IP_TOS**. * * **IPPROTO_IPV6**, which supports the following *optname*\ s: * **IPV6_TCLASS**, **IPV6_AUTOFLOWLABEL**. @@ -7080,6 +7080,7 @@ enum { TCP_BPF_SYN = 1005, /* Copy the TCP header */ TCP_BPF_SYN_IP = 1006, /* Copy the IP[46] and TCP header */ TCP_BPF_SYN_MAC = 1007, /* Copy the MAC, IP[46], and TCP header */ + TCP_BPF_SOCK_OPS_CB_FLAGS = 1008, /* Get or Set TCP sock ops flags */ }; enum { -- cgit v1.2.3 From a1d220d9dafa8d76ba60a784a1016c3134e6a1e8 Mon Sep 17 00:00:00 2001 From: Christian Brauner Date: Fri, 19 Jul 2024 13:41:52 +0200 Subject: nsfs: iterate through mount namespaces It is already possible to list mounts in other mount namespaces and to retrieve namespace file descriptors without having to go through procfs by deriving them from pidfds. Augment these abilities by adding the ability to retrieve information about a mount namespace via NS_MNT_GET_INFO. This will return the mount namespace id and the number of mounts currently in the mount namespace. The number of mounts can be used to size the buffer that needs to be used for listmount() and is in general useful without having to actually iterate through all the mounts. The structure is extensible. And add the ability to iterate through all mount namespaces over which the caller holds privilege returning the file descriptor for the next or previous mount namespace. To retrieve a mount namespace the caller must be privileged wrt to it's owning user namespace. This means that PID 1 on the host can list all mounts in all mount namespaces or that a container can list all mounts of its nested containers. Optionally pass a structure for NS_MNT_GET_INFO with NS_MNT_GET_{PREV,NEXT} to retrieve information about the mount namespace in one go. Both ioctls can be implemented for other namespace types easily. Together with recent api additions this means one can iterate through all mounts in all mount namespaces without ever touching procfs. Link: https://lore.kernel.org/r/20240719-work-mount-namespace-v1-5-834113cab0d2@kernel.org Reviewed-by: Josef Bacik Reviewed-by: Jeff Layton Signed-off-by: Christian Brauner --- fs/mount.h | 13 ++++++ fs/namespace.c | 35 ++++++++++++++-- fs/nsfs.c | 102 +++++++++++++++++++++++++++++++++++++++++++++- include/uapi/linux/nsfs.h | 16 ++++++++ 4 files changed, 160 insertions(+), 6 deletions(-) (limited to 'include/uapi') diff --git a/fs/mount.h b/fs/mount.h index ad4b1ddebb54..c1db0c709c6a 100644 --- a/fs/mount.h +++ b/fs/mount.h @@ -155,3 +155,16 @@ static inline void move_from_ns(struct mount *mnt, struct list_head *dt_list) extern void mnt_cursor_del(struct mnt_namespace *ns, struct mount *cursor); bool has_locked_children(struct mount *mnt, struct dentry *dentry); +struct mnt_namespace *__lookup_next_mnt_ns(struct mnt_namespace *mnt_ns, bool previous); +static inline struct mnt_namespace *lookup_next_mnt_ns(struct mnt_namespace *mntns) +{ + return __lookup_next_mnt_ns(mntns, false); +} +static inline struct mnt_namespace *lookup_prev_mnt_ns(struct mnt_namespace *mntns) +{ + return __lookup_next_mnt_ns(mntns, true); +} +static inline struct mnt_namespace *to_mnt_ns(struct ns_common *ns) +{ + return container_of(ns, struct mnt_namespace, ns); +} diff --git a/fs/namespace.c b/fs/namespace.c index 1a0b9e3089a2..3dabe49f7376 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -2060,14 +2060,41 @@ static bool is_mnt_ns_file(struct dentry *dentry) dentry->d_fsdata == &mntns_operations; } -static struct mnt_namespace *to_mnt_ns(struct ns_common *ns) +struct ns_common *from_mnt_ns(struct mnt_namespace *mnt) { - return container_of(ns, struct mnt_namespace, ns); + return &mnt->ns; } -struct ns_common *from_mnt_ns(struct mnt_namespace *mnt) +struct mnt_namespace *__lookup_next_mnt_ns(struct mnt_namespace *mntns, bool previous) { - return &mnt->ns; + guard(read_lock)(&mnt_ns_tree_lock); + for (;;) { + struct rb_node *node; + + if (previous) + node = rb_prev(&mntns->mnt_ns_tree_node); + else + node = rb_next(&mntns->mnt_ns_tree_node); + if (!node) + return ERR_PTR(-ENOENT); + + mntns = node_to_mnt_ns(node); + node = &mntns->mnt_ns_tree_node; + + if (!ns_capable_noaudit(mntns->user_ns, CAP_SYS_ADMIN)) + continue; + + /* + * Holding mnt_ns_tree_lock prevents the mount namespace from + * being freed but it may well be on it's deathbed. We want an + * active reference, not just a passive one here as we're + * persisting the mount namespace. + */ + if (!refcount_inc_not_zero(&mntns->ns.count)) + continue; + + return mntns; + } } static bool mnt_ns_loop(struct dentry *dentry) diff --git a/fs/nsfs.c b/fs/nsfs.c index 97c37a9631e5..67ee176b8824 100644 --- a/fs/nsfs.c +++ b/fs/nsfs.c @@ -12,6 +12,7 @@ #include #include #include +#include #include "mount.h" #include "internal.h" @@ -128,6 +129,30 @@ int open_related_ns(struct ns_common *ns, } EXPORT_SYMBOL_GPL(open_related_ns); +static int copy_ns_info_to_user(const struct mnt_namespace *mnt_ns, + struct mnt_ns_info __user *uinfo, size_t usize, + struct mnt_ns_info *kinfo) +{ + /* + * If userspace and the kernel have the same struct size it can just + * be copied. If userspace provides an older struct, only the bits that + * userspace knows about will be copied. If userspace provides a new + * struct, only the bits that the kernel knows aobut will be copied and + * the size value will be set to the size the kernel knows about. + */ + kinfo->size = min(usize, sizeof(*kinfo)); + kinfo->mnt_ns_id = mnt_ns->seq; + kinfo->nr_mounts = READ_ONCE(mnt_ns->nr_mounts); + /* Subtract the root mount of the mount namespace. */ + if (kinfo->nr_mounts) + kinfo->nr_mounts--; + + if (copy_to_user(uinfo, kinfo, kinfo->size)) + return -EFAULT; + + return 0; +} + static long ns_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg) { @@ -135,6 +160,8 @@ static long ns_ioctl(struct file *filp, unsigned int ioctl, struct pid_namespace *pid_ns; struct task_struct *tsk; struct ns_common *ns = get_proc_ns(file_inode(filp)); + struct mnt_namespace *mnt_ns; + bool previous = false; uid_t __user *argp; uid_t uid; int ret; @@ -156,7 +183,6 @@ static long ns_ioctl(struct file *filp, unsigned int ioctl, uid = from_kuid_munged(current_user_ns(), user_ns->owner); return put_user(uid, argp); case NS_GET_MNTNS_ID: { - struct mnt_namespace *mnt_ns; __u64 __user *idp; __u64 id; @@ -211,7 +237,79 @@ static long ns_ioctl(struct file *filp, unsigned int ioctl, if (!ret) ret = -ESRCH; - break; + return ret; + } + } + + /* extensible ioctls */ + switch (_IOC_NR(ioctl)) { + case _IOC_NR(NS_MNT_GET_INFO): { + struct mnt_ns_info kinfo = {}; + struct mnt_ns_info __user *uinfo = (struct mnt_ns_info __user *)arg; + size_t usize = _IOC_SIZE(ioctl); + + if (ns->ops->type != CLONE_NEWNS) + return -EINVAL; + + if (!uinfo) + return -EINVAL; + + if (usize < MNT_NS_INFO_SIZE_VER0) + return -EINVAL; + + return copy_ns_info_to_user(to_mnt_ns(ns), uinfo, usize, &kinfo); + } + case _IOC_NR(NS_MNT_GET_PREV): + previous = true; + fallthrough; + case _IOC_NR(NS_MNT_GET_NEXT): { + struct mnt_ns_info kinfo = {}; + struct mnt_ns_info __user *uinfo = (struct mnt_ns_info __user *)arg; + struct path path __free(path_put) = {}; + struct file *f __free(fput) = NULL; + size_t usize = _IOC_SIZE(ioctl); + + if (ns->ops->type != CLONE_NEWNS) + return -EINVAL; + + if (usize < MNT_NS_INFO_SIZE_VER0) + return -EINVAL; + + if (previous) + mnt_ns = lookup_prev_mnt_ns(to_mnt_ns(ns)); + else + mnt_ns = lookup_next_mnt_ns(to_mnt_ns(ns)); + if (IS_ERR(mnt_ns)) + return PTR_ERR(mnt_ns); + + ns = to_ns_common(mnt_ns); + /* Transfer ownership of @mnt_ns reference to @path. */ + ret = path_from_stashed(&ns->stashed, nsfs_mnt, ns, &path); + if (ret) + return ret; + + CLASS(get_unused_fd, fd)(O_CLOEXEC); + if (fd < 0) + return fd; + + f = dentry_open(&path, O_RDONLY, current_cred()); + if (IS_ERR(f)) + return PTR_ERR(f); + + if (uinfo) { + /* + * If @uinfo is passed return all information about the + * mount namespace as well. + */ + ret = copy_ns_info_to_user(to_mnt_ns(ns), uinfo, usize, &kinfo); + if (ret) + return ret; + } + + /* Transfer reference of @f to caller's fdtable. */ + fd_install(fd, no_free_ptr(f)); + /* File descriptor is live so hand it off to the caller. */ + return take_fd(fd); } default: ret = -ENOTTY; diff --git a/include/uapi/linux/nsfs.h b/include/uapi/linux/nsfs.h index b133211331f6..cb31af3c1840 100644 --- a/include/uapi/linux/nsfs.h +++ b/include/uapi/linux/nsfs.h @@ -3,6 +3,7 @@ #define __LINUX_NSFS_H #include +#include #define NSIO 0xb7 @@ -26,4 +27,19 @@ /* Return thread-group leader id of pid in the target pid namespace. */ #define NS_GET_TGID_IN_PIDNS _IOR(NSIO, 0x9, int) +struct mnt_ns_info { + __u32 size; + __u32 nr_mounts; + __u64 mnt_ns_id; +}; + +#define MNT_NS_INFO_SIZE_VER0 16 /* size of first published struct */ + +/* Get information about namespace. */ +#define NS_MNT_GET_INFO _IOR(NSIO, 10, struct mnt_ns_info) +/* Get next namespace. */ +#define NS_MNT_GET_NEXT _IOR(NSIO, 11, struct mnt_ns_info) +/* Get previous namespace. */ +#define NS_MNT_GET_PREV _IOR(NSIO, 12, struct mnt_ns_info) + #endif /* __LINUX_NSFS_H */ -- cgit v1.2.3 From de8f847a5114ff7cfcdfc114af8485c431dec703 Mon Sep 17 00:00:00 2001 From: Yishai Hadas Date: Thu, 1 Aug 2024 15:05:16 +0300 Subject: RDMA/mlx5: Add support for DMABUF MR registrations with Data-direct Add support for DMABUF MR registrations with Data-direct device. Upon userspace calling to register a DMABUF MR with the data direct bit set, the below algorithm will be followed. 1) Obtain a pinned DMABUF umem from the IB core using the user input parameters (FD, offset, length) and the DMA PF device. The DMA PF device is needed to allow the IOMMU to enable the DMA PF to access the user buffer over PCI. 2) Create a KSM MKEY by setting its entries according to the user buffer VA to IOVA mapping, with the MKEY being the data direct device-crossed MKEY. This KSM MKEY is umrable and will be used as part of the MR cache. The PD for creating it is the internal device 'data direct' kernel one. 3) Create a crossing MKEY that points to the KSM MKEY using the crossing access mode. 4) Manage the KSM MKEY by adding it to a list of 'data direct' MKEYs managed on the mlx5_ib device. 5) Return the crossing MKEY to the user, created with its supplied PD. Upon DMA PF unbind flow, the driver will revoke the KSM entries. The final deregistration will occur under the hood once the application deregisters its MKEY. Notes: - This version supports only the PINNED UMEM mode, so there is no dependency on ODP. - The IOVA supplied by the application must be system page aligned due to HW translations of KSM. - The crossing MKEY will not be umrable or part of the MR cache, as we cannot change its crossed (i.e. KSM) MKEY over UMR. Signed-off-by: Yishai Hadas Link: https://patch.msgid.link/1f99d8020ed540d9702b9e2252a145a439609ba6.1722512548.git.leon@kernel.org Signed-off-by: Leon Romanovsky --- drivers/infiniband/hw/mlx5/main.c | 11 ++ drivers/infiniband/hw/mlx5/mlx5_ib.h | 8 + drivers/infiniband/hw/mlx5/mr.c | 304 ++++++++++++++++++++++++++---- drivers/infiniband/hw/mlx5/odp.c | 5 +- drivers/infiniband/hw/mlx5/umr.c | 93 ++++++--- drivers/infiniband/hw/mlx5/umr.h | 1 + include/uapi/rdma/mlx5_user_ioctl_cmds.h | 4 + include/uapi/rdma/mlx5_user_ioctl_verbs.h | 4 + 8 files changed, 358 insertions(+), 72 deletions(-) (limited to 'include/uapi') diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c index fc0562f07249..b85ad3c0bfa1 100644 --- a/drivers/infiniband/hw/mlx5/main.c +++ b/drivers/infiniband/hw/mlx5/main.c @@ -3490,6 +3490,7 @@ static int mlx5_ib_data_direct_init(struct mlx5_ib_dev *dev) if (ret) return ret; + INIT_LIST_HEAD(&dev->data_direct_mr_list); ret = mlx5_data_direct_ib_reg(dev, vuid); if (ret) mlx5_ib_free_data_direct_resources(dev); @@ -3882,6 +3883,14 @@ ADD_UVERBS_ATTRIBUTES_SIMPLE( dump_fill_mkey), UA_MANDATORY)); +ADD_UVERBS_ATTRIBUTES_SIMPLE( + mlx5_ib_reg_dmabuf_mr, + UVERBS_OBJECT_MR, + UVERBS_METHOD_REG_DMABUF_MR, + UVERBS_ATTR_FLAGS_IN(MLX5_IB_ATTR_REG_DMABUF_MR_ACCESS_FLAGS, + enum mlx5_ib_uapi_reg_dmabuf_flags, + UA_OPTIONAL)); + static const struct uapi_definition mlx5_ib_defs[] = { UAPI_DEF_CHAIN(mlx5_ib_devx_defs), UAPI_DEF_CHAIN(mlx5_ib_flow_defs), @@ -3891,6 +3900,7 @@ static const struct uapi_definition mlx5_ib_defs[] = { UAPI_DEF_CHAIN(mlx5_ib_create_cq_defs), UAPI_DEF_CHAIN_OBJ_TREE(UVERBS_OBJECT_DEVICE, &mlx5_ib_query_context), + UAPI_DEF_CHAIN_OBJ_TREE(UVERBS_OBJECT_MR, &mlx5_ib_reg_dmabuf_mr), UAPI_DEF_CHAIN_OBJ_TREE_NAMED(MLX5_IB_OBJECT_VAR, UAPI_DEF_IS_OBJ_SUPPORTED(var_is_supported)), UAPI_DEF_CHAIN_OBJ_TREE_NAMED(MLX5_IB_OBJECT_UAR), @@ -4396,6 +4406,7 @@ void mlx5_ib_data_direct_bind(struct mlx5_ib_dev *ibdev, void mlx5_ib_data_direct_unbind(struct mlx5_ib_dev *ibdev) { mutex_lock(&ibdev->data_direct_lock); + mlx5_ib_revoke_data_direct_mrs(ibdev); ibdev->data_direct_dev = NULL; mutex_unlock(&ibdev->data_direct_lock); } diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h b/drivers/infiniband/hw/mlx5/mlx5_ib.h index e915a62da49c..be83a4d91a34 100644 --- a/drivers/infiniband/hw/mlx5/mlx5_ib.h +++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h @@ -682,6 +682,8 @@ struct mlx5_ib_mr { struct mlx5_ib_mkey mmkey; struct ib_umem *umem; + /* The mr is data direct related */ + u8 data_direct :1; union { /* Used only by kernel MRs (umem == NULL) */ @@ -719,6 +721,10 @@ struct mlx5_ib_mr { } odp_destroy; struct ib_odp_counters odp_stats; bool is_odp_implicit; + /* The affilated data direct crossed mr */ + struct mlx5_ib_mr *dd_crossed_mr; + struct list_head dd_node; + u8 revoked :1; }; }; }; @@ -1169,6 +1175,7 @@ struct mlx5_ib_dev { /* protect resources needed as part of reset flow */ spinlock_t reset_flow_resource_lock; struct list_head qp_list; + struct list_head data_direct_mr_list; /* Array with num_ports elements */ struct mlx5_ib_port *port; struct mlx5_sq_bfreg bfreg; @@ -1437,6 +1444,7 @@ struct ib_mr *mlx5_ib_reg_dm_mr(struct ib_pd *pd, struct ib_dm *dm, void mlx5_ib_data_direct_bind(struct mlx5_ib_dev *ibdev, struct mlx5_data_direct_dev *dev); void mlx5_ib_data_direct_unbind(struct mlx5_ib_dev *ibdev); +void mlx5_ib_revoke_data_direct_mrs(struct mlx5_ib_dev *dev); #ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING int mlx5_ib_odp_init_one(struct mlx5_ib_dev *ibdev); diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c index 1dfd9124bdd1..6829e3688b60 100644 --- a/drivers/infiniband/hw/mlx5/mr.c +++ b/drivers/infiniband/hw/mlx5/mr.c @@ -43,6 +43,7 @@ #include "dm.h" #include "mlx5_ib.h" #include "umr.h" +#include "data_direct.h" enum { MAX_PENDING_REG_MR = 8, @@ -54,7 +55,9 @@ static void create_mkey_callback(int status, struct mlx5_async_work *context); static struct mlx5_ib_mr *reg_create(struct ib_pd *pd, struct ib_umem *umem, u64 iova, int access_flags, - unsigned int page_size, bool populate); + unsigned int page_size, bool populate, + int access_mode); +static int __mlx5_ib_dereg_mr(struct ib_mr *ibmr); static void set_mkc_access_pd_addr_fields(void *mkc, int acc, u64 start_addr, struct ib_pd *pd) @@ -1126,12 +1129,10 @@ static unsigned int mlx5_umem_dmabuf_default_pgsz(struct ib_umem *umem, static struct mlx5_ib_mr *alloc_cacheable_mr(struct ib_pd *pd, struct ib_umem *umem, u64 iova, - int access_flags) + int access_flags, int access_mode) { - struct mlx5r_cache_rb_key rb_key = { - .access_mode = MLX5_MKC_ACCESS_MODE_MTT, - }; struct mlx5_ib_dev *dev = to_mdev(pd->device); + struct mlx5r_cache_rb_key rb_key = {}; struct mlx5_cache_ent *ent; struct mlx5_ib_mr *mr; unsigned int page_size; @@ -1144,6 +1145,7 @@ static struct mlx5_ib_mr *alloc_cacheable_mr(struct ib_pd *pd, if (WARN_ON(!page_size)) return ERR_PTR(-EINVAL); + rb_key.access_mode = access_mode; rb_key.ndescs = ib_umem_num_dma_blocks(umem, page_size); rb_key.ats = mlx5_umem_needs_ats(dev, umem, access_flags); rb_key.access_flags = get_unchangeable_access_flags(dev, access_flags); @@ -1154,7 +1156,7 @@ static struct mlx5_ib_mr *alloc_cacheable_mr(struct ib_pd *pd, */ if (!ent) { mutex_lock(&dev->slow_path_mutex); - mr = reg_create(pd, umem, iova, access_flags, page_size, false); + mr = reg_create(pd, umem, iova, access_flags, page_size, false, access_mode); mutex_unlock(&dev->slow_path_mutex); if (IS_ERR(mr)) return mr; @@ -1175,13 +1177,71 @@ static struct mlx5_ib_mr *alloc_cacheable_mr(struct ib_pd *pd, return mr; } +static struct ib_mr * +reg_create_crossing_vhca_mr(struct ib_pd *pd, u64 iova, u64 length, int access_flags, + u32 crossed_lkey) +{ + struct mlx5_ib_dev *dev = to_mdev(pd->device); + int access_mode = MLX5_MKC_ACCESS_MODE_CROSSING; + struct mlx5_ib_mr *mr; + void *mkc; + int inlen; + u32 *in; + int err; + + if (!MLX5_CAP_GEN(dev->mdev, crossing_vhca_mkey)) + return ERR_PTR(-EOPNOTSUPP); + + mr = kzalloc(sizeof(*mr), GFP_KERNEL); + if (!mr) + return ERR_PTR(-ENOMEM); + + inlen = MLX5_ST_SZ_BYTES(create_mkey_in); + in = kvzalloc(inlen, GFP_KERNEL); + if (!in) { + err = -ENOMEM; + goto err_1; + } + + mkc = MLX5_ADDR_OF(create_mkey_in, in, memory_key_mkey_entry); + MLX5_SET(mkc, mkc, crossing_target_vhca_id, + MLX5_CAP_GEN(dev->mdev, vhca_id)); + MLX5_SET(mkc, mkc, translations_octword_size, crossed_lkey); + MLX5_SET(mkc, mkc, access_mode_1_0, access_mode & 0x3); + MLX5_SET(mkc, mkc, access_mode_4_2, (access_mode >> 2) & 0x7); + + /* for this crossing mkey IOVA should be 0 and len should be IOVA + len */ + set_mkc_access_pd_addr_fields(mkc, access_flags, 0, pd); + MLX5_SET64(mkc, mkc, len, iova + length); + + MLX5_SET(mkc, mkc, free, 0); + MLX5_SET(mkc, mkc, umr_en, 0); + err = mlx5_ib_create_mkey(dev, &mr->mmkey, in, inlen); + if (err) + goto err_2; + + mr->mmkey.type = MLX5_MKEY_MR; + set_mr_fields(dev, mr, length, access_flags, iova); + mr->ibmr.pd = pd; + kvfree(in); + mlx5_ib_dbg(dev, "crossing mkey = 0x%x\n", mr->mmkey.key); + + return &mr->ibmr; +err_2: + kvfree(in); +err_1: + kfree(mr); + return ERR_PTR(err); +} + /* * If ibmr is NULL it will be allocated by reg_create. * Else, the given ibmr will be used. */ static struct mlx5_ib_mr *reg_create(struct ib_pd *pd, struct ib_umem *umem, u64 iova, int access_flags, - unsigned int page_size, bool populate) + unsigned int page_size, bool populate, + int access_mode) { struct mlx5_ib_dev *dev = to_mdev(pd->device); struct mlx5_ib_mr *mr; @@ -1190,7 +1250,9 @@ static struct mlx5_ib_mr *reg_create(struct ib_pd *pd, struct ib_umem *umem, int inlen; u32 *in; int err; - bool pg_cap = !!(MLX5_CAP_GEN(dev->mdev, pg)); + bool pg_cap = !!(MLX5_CAP_GEN(dev->mdev, pg)) && + (access_mode == MLX5_MKC_ACCESS_MODE_MTT); + bool ksm_mode = (access_mode == MLX5_MKC_ACCESS_MODE_KSM); if (!page_size) return ERR_PTR(-EINVAL); @@ -1213,7 +1275,7 @@ static struct mlx5_ib_mr *reg_create(struct ib_pd *pd, struct ib_umem *umem, } pas = (__be64 *)MLX5_ADDR_OF(create_mkey_in, in, klm_pas_mtt); if (populate) { - if (WARN_ON(access_flags & IB_ACCESS_ON_DEMAND)) { + if (WARN_ON(access_flags & IB_ACCESS_ON_DEMAND || ksm_mode)) { err = -EINVAL; goto err_2; } @@ -1229,14 +1291,22 @@ static struct mlx5_ib_mr *reg_create(struct ib_pd *pd, struct ib_umem *umem, mkc = MLX5_ADDR_OF(create_mkey_in, in, memory_key_mkey_entry); set_mkc_access_pd_addr_fields(mkc, access_flags, iova, populate ? pd : dev->umrc.pd); + /* In case a data direct flow, overwrite the pdn field by its internal kernel PD */ + if (umem->is_dmabuf && ksm_mode) + MLX5_SET(mkc, mkc, pd, dev->ddr.pdn); + MLX5_SET(mkc, mkc, free, !populate); - MLX5_SET(mkc, mkc, access_mode_1_0, MLX5_MKC_ACCESS_MODE_MTT); + MLX5_SET(mkc, mkc, access_mode_1_0, access_mode); MLX5_SET(mkc, mkc, umr_en, 1); MLX5_SET64(mkc, mkc, len, umem->length); MLX5_SET(mkc, mkc, bsf_octword_size, 0); - MLX5_SET(mkc, mkc, translations_octword_size, - get_octo_len(iova, umem->length, mr->page_shift)); + if (ksm_mode) + MLX5_SET(mkc, mkc, translations_octword_size, + get_octo_len(iova, umem->length, mr->page_shift) * 2); + else + MLX5_SET(mkc, mkc, translations_octword_size, + get_octo_len(iova, umem->length, mr->page_shift)); MLX5_SET(mkc, mkc, log_page_size, mr->page_shift); if (mlx5_umem_needs_ats(dev, umem, access_flags)) MLX5_SET(mkc, mkc, ma_translation_mode, 1); @@ -1373,13 +1443,15 @@ static struct ib_mr *create_real_mr(struct ib_pd *pd, struct ib_umem *umem, xlt_with_umr = mlx5r_umr_can_load_pas(dev, umem->length); if (xlt_with_umr) { - mr = alloc_cacheable_mr(pd, umem, iova, access_flags); + mr = alloc_cacheable_mr(pd, umem, iova, access_flags, + MLX5_MKC_ACCESS_MODE_MTT); } else { unsigned int page_size = mlx5_umem_find_best_pgsz( umem, mkc, log_page_size, 0, iova); mutex_lock(&dev->slow_path_mutex); - mr = reg_create(pd, umem, iova, access_flags, page_size, true); + mr = reg_create(pd, umem, iova, access_flags, page_size, + true, MLX5_MKC_ACCESS_MODE_MTT); mutex_unlock(&dev->slow_path_mutex); } if (IS_ERR(mr)) { @@ -1442,7 +1514,8 @@ static struct ib_mr *create_user_odp_mr(struct ib_pd *pd, u64 start, u64 length, if (IS_ERR(odp)) return ERR_CAST(odp); - mr = alloc_cacheable_mr(pd, &odp->umem, iova, access_flags); + mr = alloc_cacheable_mr(pd, &odp->umem, iova, access_flags, + MLX5_MKC_ACCESS_MODE_MTT); if (IS_ERR(mr)) { ib_umem_release(&odp->umem); return ERR_CAST(mr); @@ -1510,35 +1583,31 @@ static struct dma_buf_attach_ops mlx5_ib_dmabuf_attach_ops = { .move_notify = mlx5_ib_dmabuf_invalidate_cb, }; -struct ib_mr *mlx5_ib_reg_user_mr_dmabuf(struct ib_pd *pd, u64 offset, - u64 length, u64 virt_addr, - int fd, int access_flags, - struct uverbs_attr_bundle *attrs) +static struct ib_mr * +reg_user_mr_dmabuf(struct ib_pd *pd, struct device *dma_device, + u64 offset, u64 length, u64 virt_addr, + int fd, int access_flags, int access_mode) { + bool pinned_mode = (access_mode == MLX5_MKC_ACCESS_MODE_KSM); struct mlx5_ib_dev *dev = to_mdev(pd->device); struct mlx5_ib_mr *mr = NULL; struct ib_umem_dmabuf *umem_dmabuf; int err; - if (!IS_ENABLED(CONFIG_INFINIBAND_USER_MEM) || - !IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING)) - return ERR_PTR(-EOPNOTSUPP); - - mlx5_ib_dbg(dev, - "offset 0x%llx, virt_addr 0x%llx, length 0x%llx, fd %d, access_flags 0x%x\n", - offset, virt_addr, length, fd, access_flags); - err = mlx5r_umr_resource_init(dev); if (err) return ERR_PTR(err); - /* dmabuf requires xlt update via umr to work. */ - if (!mlx5r_umr_can_load_pas(dev, length)) - return ERR_PTR(-EINVAL); + if (!pinned_mode) + umem_dmabuf = ib_umem_dmabuf_get(&dev->ib_dev, + offset, length, fd, + access_flags, + &mlx5_ib_dmabuf_attach_ops); + else + umem_dmabuf = ib_umem_dmabuf_get_pinned_with_dma_device(&dev->ib_dev, + dma_device, offset, length, + fd, access_flags); - umem_dmabuf = ib_umem_dmabuf_get(&dev->ib_dev, offset, length, fd, - access_flags, - &mlx5_ib_dmabuf_attach_ops); if (IS_ERR(umem_dmabuf)) { mlx5_ib_dbg(dev, "umem_dmabuf get failed (%ld)\n", PTR_ERR(umem_dmabuf)); @@ -1546,7 +1615,7 @@ struct ib_mr *mlx5_ib_reg_user_mr_dmabuf(struct ib_pd *pd, u64 offset, } mr = alloc_cacheable_mr(pd, &umem_dmabuf->umem, virt_addr, - access_flags); + access_flags, access_mode); if (IS_ERR(mr)) { ib_umem_release(&umem_dmabuf->umem); return ERR_CAST(mr); @@ -1556,9 +1625,13 @@ struct ib_mr *mlx5_ib_reg_user_mr_dmabuf(struct ib_pd *pd, u64 offset, atomic_add(ib_umem_num_pages(mr->umem), &dev->mdev->priv.reg_pages); umem_dmabuf->private = mr; - err = mlx5r_store_odp_mkey(dev, &mr->mmkey); - if (err) - goto err_dereg_mr; + if (!pinned_mode) { + err = mlx5r_store_odp_mkey(dev, &mr->mmkey); + if (err) + goto err_dereg_mr; + } else { + mr->data_direct = true; + } err = mlx5_ib_init_dmabuf_mr(mr); if (err) @@ -1566,10 +1639,101 @@ struct ib_mr *mlx5_ib_reg_user_mr_dmabuf(struct ib_pd *pd, u64 offset, return &mr->ibmr; err_dereg_mr: - mlx5_ib_dereg_mr(&mr->ibmr, NULL); + __mlx5_ib_dereg_mr(&mr->ibmr); return ERR_PTR(err); } +static struct ib_mr * +reg_user_mr_dmabuf_by_data_direct(struct ib_pd *pd, u64 offset, + u64 length, u64 virt_addr, + int fd, int access_flags) +{ + struct mlx5_ib_dev *dev = to_mdev(pd->device); + struct mlx5_data_direct_dev *data_direct_dev; + struct ib_mr *crossing_mr; + struct ib_mr *crossed_mr; + int ret = 0; + + /* As of HW behaviour the IOVA must be page aligned in KSM mode */ + if (!PAGE_ALIGNED(virt_addr) || (access_flags & IB_ACCESS_ON_DEMAND)) + return ERR_PTR(-EOPNOTSUPP); + + mutex_lock(&dev->data_direct_lock); + data_direct_dev = dev->data_direct_dev; + if (!data_direct_dev) { + ret = -EINVAL; + goto end; + } + + /* The device's 'data direct mkey' was created without RO flags to + * simplify things and allow for a single mkey per device. + * Since RO is not a must, mask it out accordingly. + */ + access_flags &= ~IB_ACCESS_RELAXED_ORDERING; + crossed_mr = reg_user_mr_dmabuf(pd, &data_direct_dev->pdev->dev, + offset, length, virt_addr, fd, + access_flags, MLX5_MKC_ACCESS_MODE_KSM); + if (IS_ERR(crossed_mr)) { + ret = PTR_ERR(crossed_mr); + goto end; + } + + mutex_lock(&dev->slow_path_mutex); + crossing_mr = reg_create_crossing_vhca_mr(pd, virt_addr, length, access_flags, + crossed_mr->lkey); + mutex_unlock(&dev->slow_path_mutex); + if (IS_ERR(crossing_mr)) { + __mlx5_ib_dereg_mr(crossed_mr); + ret = PTR_ERR(crossing_mr); + goto end; + } + + list_add_tail(&to_mmr(crossed_mr)->dd_node, &dev->data_direct_mr_list); + to_mmr(crossing_mr)->dd_crossed_mr = to_mmr(crossed_mr); + to_mmr(crossing_mr)->data_direct = true; +end: + mutex_unlock(&dev->data_direct_lock); + return ret ? ERR_PTR(ret) : crossing_mr; +} + +struct ib_mr *mlx5_ib_reg_user_mr_dmabuf(struct ib_pd *pd, u64 offset, + u64 length, u64 virt_addr, + int fd, int access_flags, + struct uverbs_attr_bundle *attrs) +{ + struct mlx5_ib_dev *dev = to_mdev(pd->device); + int mlx5_access_flags = 0; + int err; + + if (!IS_ENABLED(CONFIG_INFINIBAND_USER_MEM) || + !IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING)) + return ERR_PTR(-EOPNOTSUPP); + + if (uverbs_attr_is_valid(attrs, MLX5_IB_ATTR_REG_DMABUF_MR_ACCESS_FLAGS)) { + err = uverbs_get_flags32(&mlx5_access_flags, attrs, + MLX5_IB_ATTR_REG_DMABUF_MR_ACCESS_FLAGS, + MLX5_IB_UAPI_REG_DMABUF_ACCESS_DATA_DIRECT); + if (err) + return ERR_PTR(err); + } + + mlx5_ib_dbg(dev, + "offset 0x%llx, virt_addr 0x%llx, length 0x%llx, fd %d, access_flags 0x%x, mlx5_access_flags 0x%x\n", + offset, virt_addr, length, fd, access_flags, mlx5_access_flags); + + /* dmabuf requires xlt update via umr to work. */ + if (!mlx5r_umr_can_load_pas(dev, length)) + return ERR_PTR(-EINVAL); + + if (mlx5_access_flags & MLX5_IB_UAPI_REG_DMABUF_ACCESS_DATA_DIRECT) + return reg_user_mr_dmabuf_by_data_direct(pd, offset, length, virt_addr, + fd, access_flags); + + return reg_user_mr_dmabuf(pd, pd->device->dma_device, + offset, length, virt_addr, + fd, access_flags, MLX5_MKC_ACCESS_MODE_MTT); +} + /* * True if the change in access flags can be done via UMR, only some access * flags can be updated. @@ -1665,7 +1829,7 @@ struct ib_mr *mlx5_ib_rereg_user_mr(struct ib_mr *ib_mr, int flags, u64 start, struct mlx5_ib_mr *mr = to_mmr(ib_mr); int err; - if (!IS_ENABLED(CONFIG_INFINIBAND_USER_MEM)) + if (!IS_ENABLED(CONFIG_INFINIBAND_USER_MEM) || mr->data_direct) return ERR_PTR(-EOPNOTSUPP); mlx5_ib_dbg( @@ -1793,7 +1957,7 @@ err: static void mlx5_free_priv_descs(struct mlx5_ib_mr *mr) { - if (!mr->umem && mr->descs) { + if (!mr->umem && !mr->data_direct && mr->descs) { struct ib_device *device = mr->ibmr.device; int size = mr->max_descs * mr->desc_size; struct mlx5_ib_dev *dev = to_mdev(device); @@ -1847,6 +2011,34 @@ end: return ret; } +static int mlx5_ib_revoke_data_direct_mr(struct mlx5_ib_mr *mr) +{ + struct mlx5_ib_dev *dev = to_mdev(mr->ibmr.device); + struct ib_umem_dmabuf *umem_dmabuf = to_ib_umem_dmabuf(mr->umem); + int err; + + lockdep_assert_held(&dev->data_direct_lock); + mr->revoked = true; + err = mlx5r_umr_revoke_mr(mr); + if (WARN_ON(err)) + return err; + + ib_umem_dmabuf_revoke(umem_dmabuf); + return 0; +} + +void mlx5_ib_revoke_data_direct_mrs(struct mlx5_ib_dev *dev) +{ + struct mlx5_ib_mr *mr, *next; + + lockdep_assert_held(&dev->data_direct_lock); + + list_for_each_entry_safe(mr, next, &dev->data_direct_mr_list, dd_node) { + list_del(&mr->dd_node); + mlx5_ib_revoke_data_direct_mr(mr); + } +} + static int mlx5_revoke_mr(struct mlx5_ib_mr *mr) { struct mlx5_ib_dev *dev = to_mdev(mr->ibmr.device); @@ -1864,7 +2056,7 @@ static int mlx5_revoke_mr(struct mlx5_ib_mr *mr) return destroy_mkey(dev, mr); } -int mlx5_ib_dereg_mr(struct ib_mr *ibmr, struct ib_udata *udata) +static int __mlx5_ib_dereg_mr(struct ib_mr *ibmr) { struct mlx5_ib_mr *mr = to_mmr(ibmr); struct mlx5_ib_dev *dev = to_mdev(ibmr->device); @@ -1931,6 +2123,36 @@ int mlx5_ib_dereg_mr(struct ib_mr *ibmr, struct ib_udata *udata) return 0; } +static int dereg_crossing_data_direct_mr(struct mlx5_ib_dev *dev, + struct mlx5_ib_mr *mr) +{ + struct mlx5_ib_mr *dd_crossed_mr = mr->dd_crossed_mr; + int ret; + + ret = __mlx5_ib_dereg_mr(&mr->ibmr); + if (ret) + return ret; + + mutex_lock(&dev->data_direct_lock); + if (!dd_crossed_mr->revoked) + list_del(&dd_crossed_mr->dd_node); + + ret = __mlx5_ib_dereg_mr(&dd_crossed_mr->ibmr); + mutex_unlock(&dev->data_direct_lock); + return ret; +} + +int mlx5_ib_dereg_mr(struct ib_mr *ibmr, struct ib_udata *udata) +{ + struct mlx5_ib_mr *mr = to_mmr(ibmr); + struct mlx5_ib_dev *dev = to_mdev(ibmr->device); + + if (mr->data_direct) + return dereg_crossing_data_direct_mr(dev, mr); + + return __mlx5_ib_dereg_mr(ibmr); +} + static void mlx5_set_umr_free_mkey(struct ib_pd *pd, u32 *in, int ndescs, int access_mode, int page_shift) { diff --git a/drivers/infiniband/hw/mlx5/odp.c b/drivers/infiniband/hw/mlx5/odp.c index a524181f34df..44a3428ea342 100644 --- a/drivers/infiniband/hw/mlx5/odp.c +++ b/drivers/infiniband/hw/mlx5/odp.c @@ -710,7 +710,10 @@ static int pagefault_dmabuf_mr(struct mlx5_ib_mr *mr, size_t bcnt, ib_umem_dmabuf_unmap_pages(umem_dmabuf); err = -EINVAL; } else { - err = mlx5r_umr_update_mr_pas(mr, xlt_flags); + if (mr->data_direct) + err = mlx5r_umr_update_data_direct_ksm_pas(mr, xlt_flags); + else + err = mlx5r_umr_update_mr_pas(mr, xlt_flags); } dma_resv_unlock(umem_dmabuf->attach->dmabuf->resv); diff --git a/drivers/infiniband/hw/mlx5/umr.c b/drivers/infiniband/hw/mlx5/umr.c index ffc31b01f690..eb74c163fd83 100644 --- a/drivers/infiniband/hw/mlx5/umr.c +++ b/drivers/infiniband/hw/mlx5/umr.c @@ -632,44 +632,47 @@ static void mlx5r_umr_final_update_xlt(struct mlx5_ib_dev *dev, wqe->data_seg.byte_count = cpu_to_be32(sg->length); } -/* - * Send the DMA list to the HW for a normal MR using UMR. - * Dmabuf MR is handled in a similar way, except that the MLX5_IB_UPD_XLT_ZAP - * flag may be used. - */ -int mlx5r_umr_update_mr_pas(struct mlx5_ib_mr *mr, unsigned int flags) +static int +_mlx5r_umr_update_mr_pas(struct mlx5_ib_mr *mr, unsigned int flags, bool dd) { + size_t ent_size = dd ? sizeof(struct mlx5_ksm) : sizeof(struct mlx5_mtt); struct mlx5_ib_dev *dev = mr_to_mdev(mr); struct device *ddev = &dev->mdev->pdev->dev; struct mlx5r_umr_wqe wqe = {}; struct ib_block_iter biter; + struct mlx5_ksm *cur_ksm; struct mlx5_mtt *cur_mtt; size_t orig_sg_length; - struct mlx5_mtt *mtt; size_t final_size; + void *curr_entry; struct ib_sge sg; + void *entry; u64 offset = 0; int err = 0; - if (WARN_ON(mr->umem->is_odp)) - return -EINVAL; - - mtt = mlx5r_umr_create_xlt( - dev, &sg, ib_umem_num_dma_blocks(mr->umem, 1 << mr->page_shift), - sizeof(*mtt), flags); - if (!mtt) + entry = mlx5r_umr_create_xlt(dev, &sg, + ib_umem_num_dma_blocks(mr->umem, 1 << mr->page_shift), + ent_size, flags); + if (!entry) return -ENOMEM; orig_sg_length = sg.length; - mlx5r_umr_set_update_xlt_ctrl_seg(&wqe.ctrl_seg, flags, &sg); mlx5r_umr_set_update_xlt_mkey_seg(dev, &wqe.mkey_seg, mr, mr->page_shift); + if (dd) { + /* Use the data direct internal kernel PD */ + MLX5_SET(mkc, &wqe.mkey_seg, pd, dev->ddr.pdn); + cur_ksm = entry; + } else { + cur_mtt = entry; + } + mlx5r_umr_set_update_xlt_data_seg(&wqe.data_seg, &sg); - cur_mtt = mtt; + curr_entry = entry; rdma_umem_for_each_dma_block(mr->umem, &biter, BIT(mr->page_shift)) { - if (cur_mtt == (void *)mtt + sg.length) { + if (curr_entry == entry + sg.length) { dma_sync_single_for_device(ddev, sg.addr, sg.length, DMA_TO_DEVICE); @@ -681,23 +684,31 @@ int mlx5r_umr_update_mr_pas(struct mlx5_ib_mr *mr, unsigned int flags) DMA_TO_DEVICE); offset += sg.length; mlx5r_umr_update_offset(&wqe.ctrl_seg, offset); - - cur_mtt = mtt; + if (dd) + cur_ksm = entry; + else + cur_mtt = entry; } - cur_mtt->ptag = - cpu_to_be64(rdma_block_iter_dma_address(&biter) | - MLX5_IB_MTT_PRESENT); - - if (mr->umem->is_dmabuf && (flags & MLX5_IB_UPD_XLT_ZAP)) - cur_mtt->ptag = 0; - - cur_mtt++; + if (dd) { + cur_ksm->va = cpu_to_be64(rdma_block_iter_dma_address(&biter)); + cur_ksm->key = cpu_to_be32(dev->ddr.mkey); + cur_ksm++; + curr_entry = cur_ksm; + } else { + cur_mtt->ptag = + cpu_to_be64(rdma_block_iter_dma_address(&biter) | + MLX5_IB_MTT_PRESENT); + if (mr->umem->is_dmabuf && (flags & MLX5_IB_UPD_XLT_ZAP)) + cur_mtt->ptag = 0; + cur_mtt++; + curr_entry = cur_mtt; + } } - final_size = (void *)cur_mtt - (void *)mtt; + final_size = curr_entry - entry; sg.length = ALIGN(final_size, MLX5_UMR_FLEX_ALIGNMENT); - memset(cur_mtt, 0, sg.length - final_size); + memset(curr_entry, 0, sg.length - final_size); mlx5r_umr_final_update_xlt(dev, &wqe, mr, &sg, flags); dma_sync_single_for_device(ddev, sg.addr, sg.length, DMA_TO_DEVICE); @@ -705,10 +716,32 @@ int mlx5r_umr_update_mr_pas(struct mlx5_ib_mr *mr, unsigned int flags) err: sg.length = orig_sg_length; - mlx5r_umr_unmap_free_xlt(dev, mtt, &sg); + mlx5r_umr_unmap_free_xlt(dev, entry, &sg); return err; } +int mlx5r_umr_update_data_direct_ksm_pas(struct mlx5_ib_mr *mr, unsigned int flags) +{ + /* No invalidation flow is expected */ + if (WARN_ON(!mr->umem->is_dmabuf) || (flags & MLX5_IB_UPD_XLT_ZAP)) + return -EINVAL; + + return _mlx5r_umr_update_mr_pas(mr, flags, true); +} + +/* + * Send the DMA list to the HW for a normal MR using UMR. + * Dmabuf MR is handled in a similar way, except that the MLX5_IB_UPD_XLT_ZAP + * flag may be used. + */ +int mlx5r_umr_update_mr_pas(struct mlx5_ib_mr *mr, unsigned int flags) +{ + if (WARN_ON(mr->umem->is_odp)) + return -EINVAL; + + return _mlx5r_umr_update_mr_pas(mr, flags, false); +} + static bool umr_can_use_indirect_mkey(struct mlx5_ib_dev *dev) { return !MLX5_CAP_GEN(dev->mdev, umr_indirect_mkey_disabled); diff --git a/drivers/infiniband/hw/mlx5/umr.h b/drivers/infiniband/hw/mlx5/umr.h index 5f734dc72bef..4a02c9b5aad8 100644 --- a/drivers/infiniband/hw/mlx5/umr.h +++ b/drivers/infiniband/hw/mlx5/umr.h @@ -95,6 +95,7 @@ int mlx5r_umr_revoke_mr(struct mlx5_ib_mr *mr); int mlx5r_umr_rereg_pd_access(struct mlx5_ib_mr *mr, struct ib_pd *pd, int access_flags); int mlx5r_umr_update_mr_pas(struct mlx5_ib_mr *mr, unsigned int flags); +int mlx5r_umr_update_data_direct_ksm_pas(struct mlx5_ib_mr *mr, unsigned int flags); int mlx5r_umr_update_xlt(struct mlx5_ib_mr *mr, u64 idx, int npages, int page_shift, int flags); diff --git a/include/uapi/rdma/mlx5_user_ioctl_cmds.h b/include/uapi/rdma/mlx5_user_ioctl_cmds.h index 5b74d6534899..106276a4cce7 100644 --- a/include/uapi/rdma/mlx5_user_ioctl_cmds.h +++ b/include/uapi/rdma/mlx5_user_ioctl_cmds.h @@ -274,6 +274,10 @@ enum mlx5_ib_create_cq_attrs { MLX5_IB_ATTR_CREATE_CQ_UAR_INDEX = UVERBS_ID_DRIVER_NS_WITH_UHW, }; +enum mlx5_ib_reg_dmabuf_mr_attrs { + MLX5_IB_ATTR_REG_DMABUF_MR_ACCESS_FLAGS = (1U << UVERBS_ID_NS_SHIFT), +}; + #define MLX5_IB_DW_MATCH_PARAM 0xA0 struct mlx5_ib_match_params { diff --git a/include/uapi/rdma/mlx5_user_ioctl_verbs.h b/include/uapi/rdma/mlx5_user_ioctl_verbs.h index 3189c7f08d17..7c233df475e7 100644 --- a/include/uapi/rdma/mlx5_user_ioctl_verbs.h +++ b/include/uapi/rdma/mlx5_user_ioctl_verbs.h @@ -54,6 +54,10 @@ enum mlx5_ib_uapi_flow_action_packet_reformat_type { MLX5_IB_UAPI_FLOW_ACTION_PACKET_REFORMAT_TYPE_L2_TO_L3_TUNNEL = 0x3, }; +enum mlx5_ib_uapi_reg_dmabuf_flags { + MLX5_IB_UAPI_REG_DMABUF_ACCESS_DATA_DIRECT = 1 << 0, +}; + struct mlx5_ib_uapi_devx_async_cmd_hdr { __aligned_u64 wr_id; __u8 out_data[]; -- cgit v1.2.3 From ec7ad6530909983c8736c80af46e3529ce7bab55 Mon Sep 17 00:00:00 2001 From: Yishai Hadas Date: Thu, 1 Aug 2024 15:05:17 +0300 Subject: RDMA/mlx5: Introduce GET_DATA_DIRECT_SYSFS_PATH ioctl Introduce the 'GET_DATA_DIRECT_SYSFS_PATH' ioctl to return the sysfs path of the affiliated 'data direct' device for a given device. Signed-off-by: Yishai Hadas Link: https://patch.msgid.link/403745463e0ef52adbef681ff09aa6a29a756352.1722512548.git.leon@kernel.org Signed-off-by: Leon Romanovsky --- drivers/infiniband/hw/mlx5/std_types.c | 55 +++++++++++++++++++++++++++++++- include/uapi/rdma/mlx5_user_ioctl_cmds.h | 5 +++ 2 files changed, 59 insertions(+), 1 deletion(-) (limited to 'include/uapi') diff --git a/drivers/infiniband/hw/mlx5/std_types.c b/drivers/infiniband/hw/mlx5/std_types.c index 5c83765a85e0..bdb568411091 100644 --- a/drivers/infiniband/hw/mlx5/std_types.c +++ b/drivers/infiniband/hw/mlx5/std_types.c @@ -10,6 +10,7 @@ #include #include #include "mlx5_ib.h" +#include "data_direct.h" #define UVERBS_MODULE_NAME mlx5_ib #include @@ -204,6 +205,50 @@ static int UVERBS_HANDLER(MLX5_IB_METHOD_QUERY_PORT)( sizeof(info)); } +static int UVERBS_HANDLER(MLX5_IB_METHOD_GET_DATA_DIRECT_SYSFS_PATH)( + struct uverbs_attr_bundle *attrs) +{ + struct mlx5_data_direct_dev *data_direct_dev; + struct mlx5_ib_ucontext *c; + struct mlx5_ib_dev *dev; + int out_len = uverbs_attr_get_len(attrs, + MLX5_IB_ATTR_GET_DATA_DIRECT_SYSFS_PATH); + u32 dev_path_len; + char *dev_path; + int ret; + + c = to_mucontext(ib_uverbs_get_ucontext(attrs)); + if (IS_ERR(c)) + return PTR_ERR(c); + dev = to_mdev(c->ibucontext.device); + mutex_lock(&dev->data_direct_lock); + data_direct_dev = dev->data_direct_dev; + if (!data_direct_dev) { + ret = -ENODEV; + goto end; + } + + dev_path = kobject_get_path(&data_direct_dev->device->kobj, GFP_KERNEL); + if (!dev_path) { + ret = -ENOMEM; + goto end; + } + + dev_path_len = strlen(dev_path) + 1; + if (dev_path_len > out_len) { + ret = -ENOSPC; + goto end; + } + + ret = uverbs_copy_to(attrs, MLX5_IB_ATTR_GET_DATA_DIRECT_SYSFS_PATH, dev_path, + dev_path_len); + kfree(dev_path); + +end: + mutex_unlock(&dev->data_direct_lock); + return ret; +} + DECLARE_UVERBS_NAMED_METHOD( MLX5_IB_METHOD_QUERY_PORT, UVERBS_ATTR_PTR_IN(MLX5_IB_ATTR_QUERY_PORT_PORT_NUM, @@ -214,9 +259,17 @@ DECLARE_UVERBS_NAMED_METHOD( reg_c0), UA_MANDATORY)); +DECLARE_UVERBS_NAMED_METHOD( + MLX5_IB_METHOD_GET_DATA_DIRECT_SYSFS_PATH, + UVERBS_ATTR_PTR_OUT( + MLX5_IB_ATTR_GET_DATA_DIRECT_SYSFS_PATH, + UVERBS_ATTR_MIN_SIZE(0), + UA_MANDATORY)); + ADD_UVERBS_METHODS(mlx5_ib_device, UVERBS_OBJECT_DEVICE, - &UVERBS_METHOD(MLX5_IB_METHOD_QUERY_PORT)); + &UVERBS_METHOD(MLX5_IB_METHOD_QUERY_PORT), + &UVERBS_METHOD(MLX5_IB_METHOD_GET_DATA_DIRECT_SYSFS_PATH)); DECLARE_UVERBS_NAMED_METHOD( MLX5_IB_METHOD_PD_QUERY, diff --git a/include/uapi/rdma/mlx5_user_ioctl_cmds.h b/include/uapi/rdma/mlx5_user_ioctl_cmds.h index 106276a4cce7..fd2e4a3a56b3 100644 --- a/include/uapi/rdma/mlx5_user_ioctl_cmds.h +++ b/include/uapi/rdma/mlx5_user_ioctl_cmds.h @@ -348,6 +348,7 @@ enum mlx5_ib_pd_methods { enum mlx5_ib_device_methods { MLX5_IB_METHOD_QUERY_PORT = (1U << UVERBS_ID_NS_SHIFT), + MLX5_IB_METHOD_GET_DATA_DIRECT_SYSFS_PATH, }; enum mlx5_ib_query_port_attrs { @@ -355,4 +356,8 @@ enum mlx5_ib_query_port_attrs { MLX5_IB_ATTR_QUERY_PORT, }; +enum mlx5_ib_get_data_direct_sysfs_path_attrs { + MLX5_IB_ATTR_GET_DATA_DIRECT_SYSFS_PATH = (1U << UVERBS_ID_NS_SHIFT), +}; + #endif -- cgit v1.2.3 From 0dc4fb69eb14320ea0fcd9657b7748eec201ccaa Mon Sep 17 00:00:00 2001 From: Mohammed Anees Date: Sun, 11 Aug 2024 06:16:51 -0400 Subject: drm: Add missing documentation for struct drm_plane_size_hint This patch takes care of the following warnings during documentation compiling: ./include/uapi/drm/drm_mode.h:869: warning: Function parameter or struct member 'width' not described in 'drm_plane_size_hint' ./include/uapi/drm/drm_mode.h:869: warning: Function parameter or struct member 'height' not described in 'drm_plane_size_hint' Signed-off-by: Mohammed Anees Signed-off-by: Daniel Vetter Link: https://patchwork.freedesktop.org/patch/msgid/20240811101653.170223-1-pvmohammedanees2003@gmail.com --- include/uapi/drm/drm_mode.h | 2 ++ 1 file changed, 2 insertions(+) (limited to 'include/uapi') diff --git a/include/uapi/drm/drm_mode.h b/include/uapi/drm/drm_mode.h index d390011b89b4..c082810c08a8 100644 --- a/include/uapi/drm/drm_mode.h +++ b/include/uapi/drm/drm_mode.h @@ -859,6 +859,8 @@ struct drm_color_lut { /** * struct drm_plane_size_hint - Plane size hints + * @width: The width of the plane in pixel + * @height: The height of the plane in pixel * * The plane SIZE_HINTS property blob contains an * array of struct drm_plane_size_hint. -- cgit v1.2.3 From e9d05e9d5db155dcc708891611f1b6f977c3fa11 Mon Sep 17 00:00:00 2001 From: Jacopo Mondi Date: Thu, 8 Aug 2024 22:40:54 +0200 Subject: media: uapi: rkisp1-config: Add extensible params format Add to the rkisp1-config.h header data types and documentation of the extensible parameters format. Signed-off-by: Jacopo Mondi Reviewed-by: Laurent Pinchart Reviewed-by: Paul Elder Tested-by: Kieran Bingham Acked-by: Sakari Ailus Signed-off-by: Laurent Pinchart --- include/uapi/linux/rkisp1-config.h | 491 +++++++++++++++++++++++++++++++++++++ 1 file changed, 491 insertions(+) (limited to 'include/uapi') diff --git a/include/uapi/linux/rkisp1-config.h b/include/uapi/linux/rkisp1-config.h index 6eeaf8bf2362..9767bb19c72d 100644 --- a/include/uapi/linux/rkisp1-config.h +++ b/include/uapi/linux/rkisp1-config.h @@ -996,4 +996,495 @@ struct rkisp1_stat_buffer { struct rkisp1_cif_isp_stat params; }; +/*---------- PART3: Extensible Configuration Parameters ------------*/ + +/** + * enum rkisp1_ext_params_block_type - RkISP1 extensible params block type + * + * @RKISP1_EXT_PARAMS_BLOCK_TYPE_BLS: Black level subtraction + * @RKISP1_EXT_PARAMS_BLOCK_TYPE_DPCC: Defect pixel cluster correction + * @RKISP1_EXT_PARAMS_BLOCK_TYPE_SDG: Sensor de-gamma + * @RKISP1_EXT_PARAMS_BLOCK_TYPE_AWB_GAIN: Auto white balance gains + * @RKISP1_EXT_PARAMS_BLOCK_TYPE_FLT: ISP filtering + * @RKISP1_EXT_PARAMS_BLOCK_TYPE_BDM: Bayer de-mosaic + * @RKISP1_EXT_PARAMS_BLOCK_TYPE_CTK: Cross-talk correction + * @RKISP1_EXT_PARAMS_BLOCK_TYPE_GOC: Gamma out correction + * @RKISP1_EXT_PARAMS_BLOCK_TYPE_DPF: De-noise pre-filter + * @RKISP1_EXT_PARAMS_BLOCK_TYPE_DPF_STRENGTH: De-noise pre-filter strength + * @RKISP1_EXT_PARAMS_BLOCK_TYPE_CPROC: Color processing + * @RKISP1_EXT_PARAMS_BLOCK_TYPE_IE: Image effects + * @RKISP1_EXT_PARAMS_BLOCK_TYPE_LSC: Lens shading correction + * @RKISP1_EXT_PARAMS_BLOCK_TYPE_AWB_MEAS: Auto white balance statistics + * @RKISP1_EXT_PARAMS_BLOCK_TYPE_HST_MEAS: Histogram statistics + * @RKISP1_EXT_PARAMS_BLOCK_TYPE_AEC_MEAS: Auto exposure statistics + * @RKISP1_EXT_PARAMS_BLOCK_TYPE_AFC_MEAS: Auto-focus statistics + */ +enum rkisp1_ext_params_block_type { + RKISP1_EXT_PARAMS_BLOCK_TYPE_BLS, + RKISP1_EXT_PARAMS_BLOCK_TYPE_DPCC, + RKISP1_EXT_PARAMS_BLOCK_TYPE_SDG, + RKISP1_EXT_PARAMS_BLOCK_TYPE_AWB_GAIN, + RKISP1_EXT_PARAMS_BLOCK_TYPE_FLT, + RKISP1_EXT_PARAMS_BLOCK_TYPE_BDM, + RKISP1_EXT_PARAMS_BLOCK_TYPE_CTK, + RKISP1_EXT_PARAMS_BLOCK_TYPE_GOC, + RKISP1_EXT_PARAMS_BLOCK_TYPE_DPF, + RKISP1_EXT_PARAMS_BLOCK_TYPE_DPF_STRENGTH, + RKISP1_EXT_PARAMS_BLOCK_TYPE_CPROC, + RKISP1_EXT_PARAMS_BLOCK_TYPE_IE, + RKISP1_EXT_PARAMS_BLOCK_TYPE_LSC, + RKISP1_EXT_PARAMS_BLOCK_TYPE_AWB_MEAS, + RKISP1_EXT_PARAMS_BLOCK_TYPE_HST_MEAS, + RKISP1_EXT_PARAMS_BLOCK_TYPE_AEC_MEAS, + RKISP1_EXT_PARAMS_BLOCK_TYPE_AFC_MEAS, +}; + +#define RKISP1_EXT_PARAMS_FL_BLOCK_DISABLE (1U << 0) +#define RKISP1_EXT_PARAMS_FL_BLOCK_ENABLE (1U << 1) + +/** + * struct rkisp1_ext_params_block_header - RkISP1 extensible parameters block + * header + * + * This structure represents the common part of all the ISP configuration + * blocks. Each parameters block shall embed an instance of this structure type + * as its first member, followed by the block-specific configuration data. The + * driver inspects this common header to discern the block type and its size and + * properly handle the block content by casting it to the correct block-specific + * type. + * + * The @type field is one of the values enumerated by + * :c:type:`rkisp1_ext_params_block_type` and specifies how the data should be + * interpreted by the driver. The @size field specifies the size of the + * parameters block and is used by the driver for validation purposes. + * + * The @flags field is a bitmask of per-block flags RKISP1_EXT_PARAMS_FL_*. + * + * When userspace wants to configure and enable an ISP block it shall fully + * populate the block configuration and set the + * RKISP1_EXT_PARAMS_FL_BLOCK_ENABLE bit in the @flags field. + * + * When userspace simply wants to disable an ISP block the + * RKISP1_EXT_PARAMS_FL_BLOCK_DISABLE bit should be set in @flags field. The + * driver ignores the rest of the block configuration structure in this case. + * + * If a new configuration of an ISP block has to be applied userspace shall + * fully populate the ISP block configuration and omit setting the + * RKISP1_EXT_PARAMS_FL_BLOCK_ENABLE and RKISP1_EXT_PARAMS_FL_BLOCK_DISABLE bits + * in the @flags field. + * + * Setting both the RKISP1_EXT_PARAMS_FL_BLOCK_ENABLE and + * RKISP1_EXT_PARAMS_FL_BLOCK_DISABLE bits in the @flags field is not allowed + * and not accepted by the driver. + * + * Userspace is responsible for correctly populating the parameters block header + * fields (@type, @flags and @size) and the block-specific parameters. + * + * For example: + * + * .. code-block:: c + * + * void populate_bls(struct rkisp1_ext_params_block_header *block) { + * struct rkisp1_ext_params_bls_config *bls = + * (struct rkisp1_ext_params_bls_config *)block; + * + * bls->header.type = RKISP1_EXT_PARAMS_BLOCK_ID_BLS; + * bls->header.flags = RKISP1_EXT_PARAMS_FL_BLOCK_ENABLE; + * bls->header.size = sizeof(*bls); + * + * bls->config.enable_auto = 0; + * bls->config.fixed_val.r = blackLevelRed_; + * bls->config.fixed_val.gr = blackLevelGreenR_; + * bls->config.fixed_val.gb = blackLevelGreenB_; + * bls->config.fixed_val.b = blackLevelBlue_; + * } + * + * @type: The parameters block type, see + * :c:type:`rkisp1_ext_params_block_type` + * @flags: A bitmask of block flags + * @size: Size (in bytes) of the parameters block, including this header + */ +struct rkisp1_ext_params_block_header { + __u16 type; + __u16 flags; + __u32 size; +}; + +/** + * struct rkisp1_ext_params_bls_config - RkISP1 extensible params BLS config + * + * RkISP1 extensible parameters Black Level Subtraction configuration block. + * Identified by :c:type:`RKISP1_EXT_PARAMS_BLOCK_TYPE_BLS`. + * + * @header: The RkISP1 extensible parameters header, see + * :c:type:`rkisp1_ext_params_block_header` + * @config: Black Level Subtraction configuration, see + * :c:type:`rkisp1_cif_isp_bls_config` + */ +struct rkisp1_ext_params_bls_config { + struct rkisp1_ext_params_block_header header; + struct rkisp1_cif_isp_bls_config config; +} __attribute__((aligned(8))); + +/** + * struct rkisp1_ext_params_dpcc_config - RkISP1 extensible params DPCC config + * + * RkISP1 extensible parameters Defective Pixel Cluster Correction configuration + * block. Identified by :c:type:`RKISP1_EXT_PARAMS_BLOCK_TYPE_DPCC`. + * + * @header: The RkISP1 extensible parameters header, see + * :c:type:`rkisp1_ext_params_block_header` + * @config: Defective Pixel Cluster Correction configuration, see + * :c:type:`rkisp1_cif_isp_dpcc_config` + */ +struct rkisp1_ext_params_dpcc_config { + struct rkisp1_ext_params_block_header header; + struct rkisp1_cif_isp_dpcc_config config; +} __attribute__((aligned(8))); + +/** + * struct rkisp1_ext_params_sdg_config - RkISP1 extensible params SDG config + * + * RkISP1 extensible parameters Sensor Degamma configuration block. Identified + * by :c:type:`RKISP1_EXT_PARAMS_BLOCK_TYPE_SDG`. + * + * @header: The RkISP1 extensible parameters header, see + * :c:type:`rkisp1_ext_params_block_header` + * @config: Sensor Degamma configuration, see + * :c:type:`rkisp1_cif_isp_sdg_config` + */ +struct rkisp1_ext_params_sdg_config { + struct rkisp1_ext_params_block_header header; + struct rkisp1_cif_isp_sdg_config config; +} __attribute__((aligned(8))); + +/** + * struct rkisp1_ext_params_lsc_config - RkISP1 extensible params LSC config + * + * RkISP1 extensible parameters Lens Shading Correction configuration block. + * Identified by :c:type:`RKISP1_EXT_PARAMS_BLOCK_TYPE_LSC`. + * + * @header: The RkISP1 extensible parameters header, see + * :c:type:`rkisp1_ext_params_block_header` + * @config: Lens Shading Correction configuration, see + * :c:type:`rkisp1_cif_isp_lsc_config` + */ +struct rkisp1_ext_params_lsc_config { + struct rkisp1_ext_params_block_header header; + struct rkisp1_cif_isp_lsc_config config; +} __attribute__((aligned(8))); + +/** + * struct rkisp1_ext_params_awb_gain_config - RkISP1 extensible params AWB + * gain config + * + * RkISP1 extensible parameters Auto-White Balance Gains configuration block. + * Identified by :c:type:`RKISP1_EXT_PARAMS_BLOCK_TYPE_AWB_GAIN`. + * + * @header: The RkISP1 extensible parameters header, see + * :c:type:`rkisp1_ext_params_block_header` + * @config: Auto-White Balance Gains configuration, see + * :c:type:`rkisp1_cif_isp_awb_gain_config` + */ +struct rkisp1_ext_params_awb_gain_config { + struct rkisp1_ext_params_block_header header; + struct rkisp1_cif_isp_awb_gain_config config; +} __attribute__((aligned(8))); + +/** + * struct rkisp1_ext_params_flt_config - RkISP1 extensible params FLT config + * + * RkISP1 extensible parameters Filter configuration block. Identified by + * :c:type:`RKISP1_EXT_PARAMS_BLOCK_TYPE_FLT`. + * + * @header: The RkISP1 extensible parameters header, see + * :c:type:`rkisp1_ext_params_block_header` + * @config: Filter configuration, see :c:type:`rkisp1_cif_isp_flt_config` + */ +struct rkisp1_ext_params_flt_config { + struct rkisp1_ext_params_block_header header; + struct rkisp1_cif_isp_flt_config config; +} __attribute__((aligned(8))); + +/** + * struct rkisp1_ext_params_bdm_config - RkISP1 extensible params BDM config + * + * RkISP1 extensible parameters Demosaicing configuration block. Identified by + * :c:type:`RKISP1_EXT_PARAMS_BLOCK_TYPE_BDM`. + * + * @header: The RkISP1 extensible parameters header, see + * :c:type:`rkisp1_ext_params_block_header` + * @config: Demosaicing configuration, see :c:type:`rkisp1_cif_isp_bdm_config` + */ +struct rkisp1_ext_params_bdm_config { + struct rkisp1_ext_params_block_header header; + struct rkisp1_cif_isp_bdm_config config; +} __attribute__((aligned(8))); + +/** + * struct rkisp1_ext_params_ctk_config - RkISP1 extensible params CTK config + * + * RkISP1 extensible parameters Cross-Talk configuration block. Identified by + * :c:type:`RKISP1_EXT_PARAMS_BLOCK_TYPE_CTK`. + * + * @header: The RkISP1 extensible parameters header, see + * :c:type:`rkisp1_ext_params_block_header` + * @config: Cross-Talk configuration, see :c:type:`rkisp1_cif_isp_ctk_config` + */ +struct rkisp1_ext_params_ctk_config { + struct rkisp1_ext_params_block_header header; + struct rkisp1_cif_isp_ctk_config config; +} __attribute__((aligned(8))); + +/** + * struct rkisp1_ext_params_goc_config - RkISP1 extensible params GOC config + * + * RkISP1 extensible parameters Gamma-Out configuration block. Identified by + * :c:type:`RKISP1_EXT_PARAMS_BLOCK_TYPE_GOC`. + * + * @header: The RkISP1 extensible parameters header, see + * :c:type:`rkisp1_ext_params_block_header` + * @config: Gamma-Out configuration, see :c:type:`rkisp1_cif_isp_goc_config` + */ +struct rkisp1_ext_params_goc_config { + struct rkisp1_ext_params_block_header header; + struct rkisp1_cif_isp_goc_config config; +} __attribute__((aligned(8))); + +/** + * struct rkisp1_ext_params_dpf_config - RkISP1 extensible params DPF config + * + * RkISP1 extensible parameters De-noise Pre-Filter configuration block. + * Identified by :c:type:`RKISP1_EXT_PARAMS_BLOCK_TYPE_DPF`. + * + * @header: The RkISP1 extensible parameters header, see + * :c:type:`rkisp1_ext_params_block_header` + * @config: De-noise Pre-Filter configuration, see + * :c:type:`rkisp1_cif_isp_dpf_config` + */ +struct rkisp1_ext_params_dpf_config { + struct rkisp1_ext_params_block_header header; + struct rkisp1_cif_isp_dpf_config config; +} __attribute__((aligned(8))); + +/** + * struct rkisp1_ext_params_dpf_strength_config - RkISP1 extensible params DPF + * strength config + * + * RkISP1 extensible parameters De-noise Pre-Filter strength configuration + * block. Identified by :c:type:`RKISP1_EXT_PARAMS_BLOCK_TYPE_DPF_STRENGTH`. + * + * @header: The RkISP1 extensible parameters header, see + * :c:type:`rkisp1_ext_params_block_header` + * @config: De-noise Pre-Filter strength configuration, see + * :c:type:`rkisp1_cif_isp_dpf_strength_config` + */ +struct rkisp1_ext_params_dpf_strength_config { + struct rkisp1_ext_params_block_header header; + struct rkisp1_cif_isp_dpf_strength_config config; +} __attribute__((aligned(8))); + +/** + * struct rkisp1_ext_params_cproc_config - RkISP1 extensible params CPROC config + * + * RkISP1 extensible parameters Color Processing configuration block. + * Identified by :c:type:`RKISP1_EXT_PARAMS_BLOCK_TYPE_CPROC`. + * + * @header: The RkISP1 extensible parameters header, see + * :c:type:`rkisp1_ext_params_block_header` + * @config: Color processing configuration, see + * :c:type:`rkisp1_cif_isp_cproc_config` + */ +struct rkisp1_ext_params_cproc_config { + struct rkisp1_ext_params_block_header header; + struct rkisp1_cif_isp_cproc_config config; +} __attribute__((aligned(8))); + +/** + * struct rkisp1_ext_params_ie_config - RkISP1 extensible params IE config + * + * RkISP1 extensible parameters Image Effect configuration block. Identified by + * :c:type:`RKISP1_EXT_PARAMS_BLOCK_TYPE_IE`. + * + * @header: The RkISP1 extensible parameters header, see + * :c:type:`rkisp1_ext_params_block_header` + * @config: Image Effect configuration, see :c:type:`rkisp1_cif_isp_ie_config` + */ +struct rkisp1_ext_params_ie_config { + struct rkisp1_ext_params_block_header header; + struct rkisp1_cif_isp_ie_config config; +} __attribute__((aligned(8))); + +/** + * struct rkisp1_ext_params_awb_meas_config - RkISP1 extensible params AWB + * Meas config + * + * RkISP1 extensible parameters Auto-White Balance Measurement configuration + * block. Identified by :c:type:`RKISP1_EXT_PARAMS_BLOCK_TYPE_AWB_MEAS`. + * + * @header: The RkISP1 extensible parameters header, see + * :c:type:`rkisp1_ext_params_block_header` + * @config: Auto-White Balance measure configuration, see + * :c:type:`rkisp1_cif_isp_awb_meas_config` + */ +struct rkisp1_ext_params_awb_meas_config { + struct rkisp1_ext_params_block_header header; + struct rkisp1_cif_isp_awb_meas_config config; +} __attribute__((aligned(8))); + +/** + * struct rkisp1_ext_params_hst_config - RkISP1 extensible params Histogram config + * + * RkISP1 extensible parameters Histogram statistics configuration block. + * Identified by :c:type:`RKISP1_EXT_PARAMS_BLOCK_TYPE_HST_MEAS`. + * + * @header: The RkISP1 extensible parameters header, see + * :c:type:`rkisp1_ext_params_block_header` + * @config: Histogram statistics configuration, see + * :c:type:`rkisp1_cif_isp_hst_config` + */ +struct rkisp1_ext_params_hst_config { + struct rkisp1_ext_params_block_header header; + struct rkisp1_cif_isp_hst_config config; +} __attribute__((aligned(8))); + +/** + * struct rkisp1_ext_params_aec_config - RkISP1 extensible params AEC config + * + * RkISP1 extensible parameters Auto-Exposure statistics configuration block. + * Identified by :c:type:`RKISP1_EXT_PARAMS_BLOCK_TYPE_AEC_MEAS`. + * + * @header: The RkISP1 extensible parameters header, see + * :c:type:`rkisp1_ext_params_block_header` + * @config: Auto-Exposure statistics configuration, see + * :c:type:`rkisp1_cif_isp_aec_config` + */ +struct rkisp1_ext_params_aec_config { + struct rkisp1_ext_params_block_header header; + struct rkisp1_cif_isp_aec_config config; +} __attribute__((aligned(8))); + +/** + * struct rkisp1_ext_params_afc_config - RkISP1 extensible params AFC config + * + * RkISP1 extensible parameters Auto-Focus statistics configuration block. + * Identified by :c:type:`RKISP1_EXT_PARAMS_BLOCK_TYPE_AFC_MEAS`. + * + * @header: The RkISP1 extensible parameters header, see + * :c:type:`rkisp1_ext_params_block_header` + * @config: Auto-Focus statistics configuration, see + * :c:type:`rkisp1_cif_isp_afc_config` + */ +struct rkisp1_ext_params_afc_config { + struct rkisp1_ext_params_block_header header; + struct rkisp1_cif_isp_afc_config config; +} __attribute__((aligned(8))); + +#define RKISP1_EXT_PARAMS_MAX_SIZE \ + (sizeof(struct rkisp1_ext_params_bls_config) +\ + sizeof(struct rkisp1_ext_params_dpcc_config) +\ + sizeof(struct rkisp1_ext_params_sdg_config) +\ + sizeof(struct rkisp1_ext_params_lsc_config) +\ + sizeof(struct rkisp1_ext_params_awb_gain_config) +\ + sizeof(struct rkisp1_ext_params_flt_config) +\ + sizeof(struct rkisp1_ext_params_bdm_config) +\ + sizeof(struct rkisp1_ext_params_ctk_config) +\ + sizeof(struct rkisp1_ext_params_goc_config) +\ + sizeof(struct rkisp1_ext_params_dpf_config) +\ + sizeof(struct rkisp1_ext_params_dpf_strength_config) +\ + sizeof(struct rkisp1_ext_params_cproc_config) +\ + sizeof(struct rkisp1_ext_params_ie_config) +\ + sizeof(struct rkisp1_ext_params_awb_meas_config) +\ + sizeof(struct rkisp1_ext_params_hst_config) +\ + sizeof(struct rkisp1_ext_params_aec_config) +\ + sizeof(struct rkisp1_ext_params_afc_config)) + +/** + * enum rksip1_ext_param_buffer_version - RkISP1 extensible parameters version + * + * @RKISP1_EXT_PARAM_BUFFER_V1: First version of RkISP1 extensible parameters + */ +enum rksip1_ext_param_buffer_version { + RKISP1_EXT_PARAM_BUFFER_V1 = 1, +}; + +/** + * struct rkisp1_ext_params_cfg - RkISP1 extensible parameters configuration + * + * This struct contains the configuration parameters of the RkISP1 ISP + * algorithms, serialized by userspace into a data buffer. Each configuration + * parameter block is represented by a block-specific structure which contains a + * :c:type:`rkisp1_ext_params_block_header` entry as first member. Userspace + * populates the @data buffer with configuration parameters for the blocks that + * it intends to configure. As a consequence, the data buffer effective size + * changes according to the number of ISP blocks that userspace intends to + * configure and is set by userspace in the @data_size field. + * + * The parameters buffer is versioned by the @version field to allow modifying + * and extending its definition. Userspace shall populate the @version field to + * inform the driver about the version it intends to use. The driver will parse + * and handle the @data buffer according to the data layout specific to the + * indicated version and return an error if the desired version is not + * supported. + * + * Currently the single RKISP1_EXT_PARAM_BUFFER_V1 version is supported. + * When a new format version will be added, a mechanism for userspace to query + * the supported format versions will be implemented in the form of a read-only + * V4L2 control. If such control is not available, userspace should assume only + * RKISP1_EXT_PARAM_BUFFER_V1 is supported by the driver. + * + * For each ISP block that userspace wants to configure, a block-specific + * structure is appended to the @data buffer, one after the other without gaps + * in between nor overlaps. Userspace shall populate the @data_size field with + * the effective size, in bytes, of the @data buffer. + * + * The expected memory layout of the parameters buffer is:: + * + * +-------------------- struct rkisp1_ext_params_cfg -------------------+ + * | version = RKISP_EXT_PARAMS_BUFFER_V1; | + * | data_size = sizeof(struct rkisp1_ext_params_bls_config) | + * | + sizeof(struct rkisp1_ext_params_dpcc_config); | + * | +------------------------- data ---------------------------------+ | + * | | +------------- struct rkisp1_ext_params_bls_config -----------+ | | + * | | | +-------- struct rkisp1_ext_params_block_header ---------+ | | | + * | | | | type = RKISP1_EXT_PARAMS_BLOCK_TYPE_BLS; | | | | + * | | | | flags = RKISP1_EXT_PARAMS_FL_BLOCK_ENABLE; | | | | + * | | | | size = sizeof(struct rkisp1_ext_params_bls_config); | | | | + * | | | +---------------------------------------------------------+ | | | + * | | | +---------- struct rkisp1_cif_isp_bls_config -------------+ | | | + * | | | | enable_auto = 0; | | | | + * | | | | fixed_val.r = 256; | | | | + * | | | | fixed_val.gr = 256; | | | | + * | | | | fixed_val.gb = 256; | | | | + * | | | | fixed_val.b = 256; | | | | + * | | | +---------------------------------------------------------+ | | | + * | | +------------ struct rkisp1_ext_params_dpcc_config -----------+ | | + * | | | +-------- struct rkisp1_ext_params_block_header ---------+ | | | + * | | | | type = RKISP1_EXT_PARAMS_BLOCK_TYPE_DPCC; | | | | + * | | | | flags = RKISP1_EXT_PARAMS_FL_BLOCK_ENABLE; | | | | + * | | | | size = sizeof(struct rkisp1_ext_params_dpcc_config); | | | | + * | | | +---------------------------------------------------------+ | | | + * | | | +---------- struct rkisp1_cif_isp_dpcc_config ------------+ | | | + * | | | | mode = RKISP1_CIF_ISP_DPCC_MODE_STAGE1_ENABLE; | | | | + * | | | | output_mode = | | | | + * | | | | RKISP1_CIF_ISP_DPCC_OUTPUT_MODE_STAGE1_INCL_G_CENTER; | | | | + * | | | | set_use = ... ; | | | | + * | | | | ... = ... ; | | | | + * | | | +---------------------------------------------------------+ | | | + * | | +-------------------------------------------------------------+ | | + * | +-----------------------------------------------------------------+ | + * +---------------------------------------------------------------------+ + * + * @version: The RkISP1 extensible parameters buffer version, see + * :c:type:`rksip1_ext_param_buffer_version` + * @data_size: The RkISP1 configuration data effective size, excluding this + * header + * @data: The RkISP1 extensible configuration data blocks + */ +struct rkisp1_ext_params_cfg { + __u32 version; + __u32 data_size; + __u8 data[RKISP1_EXT_PARAMS_MAX_SIZE]; +}; + #endif /* _UAPI_RKISP1_CONFIG_H */ -- cgit v1.2.3 From 1fc379f6241b331207c4573c4ff43526fe404301 Mon Sep 17 00:00:00 2001 From: Jacopo Mondi Date: Thu, 8 Aug 2024 22:40:55 +0200 Subject: media: uapi: videodev2: Add V4L2_META_FMT_RK_ISP1_EXT_PARAMS The rkisp1 driver stores ISP configuration parameters in the fixed rkisp1_params_cfg structure. As the members of the structure are part of the userspace API, the structure layout is immutable and cannot be extended further. Introducing new parameters or modifying the existing ones would change the buffer layout and cause breakages in existing applications. The allow for future extensions to the ISP parameters, introduce a new extensible parameters format, with a new format 4CC. Document usage of the new format in the rkisp1 admin guide. Signed-off-by: Jacopo Mondi Reviewed-by: Daniel Scally Reviewed-by: Paul Elder Reviewed-by: Laurent Pinchart Tested-by: Kieran Bingham Acked-by: Sakari Ailus Signed-off-by: Laurent Pinchart --- Documentation/admin-guide/media/rkisp1.rst | 11 ++++- .../userspace-api/media/v4l/metafmt-rkisp1.rst | 57 ++++++++++++++++++---- drivers/media/v4l2-core/v4l2-ioctl.c | 1 + include/uapi/linux/videodev2.h | 1 + 4 files changed, 59 insertions(+), 11 deletions(-) (limited to 'include/uapi') diff --git a/Documentation/admin-guide/media/rkisp1.rst b/Documentation/admin-guide/media/rkisp1.rst index 6f14d9561fa5..6c878c71442f 100644 --- a/Documentation/admin-guide/media/rkisp1.rst +++ b/Documentation/admin-guide/media/rkisp1.rst @@ -114,11 +114,18 @@ to be applied to the hardware during a video stream, allowing userspace to dynamically modify values such as black level, cross talk corrections and others. -The buffer format is defined by struct :c:type:`rkisp1_params_cfg`, and -userspace should set +The ISP driver supports two different parameters configuration methods, the +`fixed parameters format` or the `extensible parameters format`. + +When using the `fixed parameters` method the buffer format is defined by struct +:c:type:`rkisp1_params_cfg`, and userspace should set :ref:`V4L2_META_FMT_RK_ISP1_PARAMS ` as the dataformat. +When using the `extensible parameters` method the buffer format is defined by +struct :c:type:`rkisp1_ext_params_cfg`, and userspace should set +:ref:`V4L2_META_FMT_RK_ISP1_EXT_PARAMS ` as +the dataformat. Capturing Video Frames Example ============================== diff --git a/Documentation/userspace-api/media/v4l/metafmt-rkisp1.rst b/Documentation/userspace-api/media/v4l/metafmt-rkisp1.rst index fa04f00bcd2e..959f6bde8695 100644 --- a/Documentation/userspace-api/media/v4l/metafmt-rkisp1.rst +++ b/Documentation/userspace-api/media/v4l/metafmt-rkisp1.rst @@ -1,28 +1,67 @@ .. SPDX-License-Identifier: GPL-2.0 -.. _v4l2-meta-fmt-rk-isp1-params: - .. _v4l2-meta-fmt-rk-isp1-stat-3a: -***************************************************************************** -V4L2_META_FMT_RK_ISP1_PARAMS ('rk1p'), V4L2_META_FMT_RK_ISP1_STAT_3A ('rk1s') -***************************************************************************** +************************************************************************************************************************ +V4L2_META_FMT_RK_ISP1_PARAMS ('rk1p'), V4L2_META_FMT_RK_ISP1_STAT_3A ('rk1s'), V4L2_META_FMT_RK_ISP1_EXT_PARAMS ('rk1e') +************************************************************************************************************************ +======================== Configuration parameters ======================== -The configuration parameters are passed to the +The configuration of the RkISP1 ISP is performed by userspace by providing +parameters for the ISP to the driver using the :c:type:`v4l2_meta_format` +interface. + +There are two methods that allow to configure the ISP, the `fixed parameters` +configuration format and the `extensible parameters` configuration +format. + +.. _v4l2-meta-fmt-rk-isp1-params: + +Fixed parameters configuration format +===================================== + +When using the fixed configuration format, parameters are passed to the :ref:`rkisp1_params ` metadata output video node, using -the :c:type:`v4l2_meta_format` interface. The buffer contains -a single instance of the C structure :c:type:`rkisp1_params_cfg` defined in -``rkisp1-config.h``. So the structure can be obtained from the buffer by: +the `V4L2_META_FMT_RK_ISP1_PARAMS` meta format. + +The buffer contains a single instance of the C structure +:c:type:`rkisp1_params_cfg` defined in ``rkisp1-config.h``. So the structure can +be obtained from the buffer by: .. code-block:: c struct rkisp1_params_cfg *params = (struct rkisp1_params_cfg*) buffer; +This method supports a subset of the ISP features only, new applications should +use the extensible parameters method. + +.. _v4l2-meta-fmt-rk-isp1-ext-params: + +Extensible parameters configuration format +========================================== + +When using the extensible configuration format, parameters are passed to the +:ref:`rkisp1_params ` metadata output video node, using +the `V4L2_META_FMT_RK_ISP1_EXT_PARAMS` meta format. + +The buffer contains a single instance of the C structure +:c:type:`rkisp1_ext_params_cfg` defined in ``rkisp1-config.h``. The +:c:type:`rkisp1_ext_params_cfg` structure is designed to allow userspace to +populate the data buffer with only the configuration data for the ISP blocks it +intends to configure. The extensible parameters format design allows developers +to define new block types to support new configuration parameters, and defines a +versioning scheme so that it can be extended and versioned without breaking +compatibility with existing applications. + +For these reasons, this configuration method is preferred over the `fixed +parameters` format alternative. + .. rkisp1_stat_buffer +=========================== 3A and histogram statistics =========================== diff --git a/drivers/media/v4l2-core/v4l2-ioctl.c b/drivers/media/v4l2-core/v4l2-ioctl.c index 64c3e79d6378..e14db67be97c 100644 --- a/drivers/media/v4l2-core/v4l2-ioctl.c +++ b/drivers/media/v4l2-core/v4l2-ioctl.c @@ -1458,6 +1458,7 @@ static void v4l_fill_fmtdesc(struct v4l2_fmtdesc *fmt) case V4L2_META_FMT_VIVID: descr = "Vivid Metadata"; break; case V4L2_META_FMT_RK_ISP1_PARAMS: descr = "Rockchip ISP1 3A Parameters"; break; case V4L2_META_FMT_RK_ISP1_STAT_3A: descr = "Rockchip ISP1 3A Statistics"; break; + case V4L2_META_FMT_RK_ISP1_EXT_PARAMS: descr = "Rockchip ISP1 Ext 3A Params"; break; case V4L2_PIX_FMT_NV12_8L128: descr = "NV12 (8x128 Linear)"; break; case V4L2_PIX_FMT_NV12M_8L128: descr = "NV12M (8x128 Linear)"; break; case V4L2_PIX_FMT_NV12_10BE_8L128: descr = "10-bit NV12 (8x128 Linear, BE)"; break; diff --git a/include/uapi/linux/videodev2.h b/include/uapi/linux/videodev2.h index 4e91362da6da..725e86c4bbbd 100644 --- a/include/uapi/linux/videodev2.h +++ b/include/uapi/linux/videodev2.h @@ -854,6 +854,7 @@ struct v4l2_pix_format { /* Vendor specific - used for RK_ISP1 camera sub-system */ #define V4L2_META_FMT_RK_ISP1_PARAMS v4l2_fourcc('R', 'K', '1', 'P') /* Rockchip ISP1 3A Parameters */ #define V4L2_META_FMT_RK_ISP1_STAT_3A v4l2_fourcc('R', 'K', '1', 'S') /* Rockchip ISP1 3A Statistics */ +#define V4L2_META_FMT_RK_ISP1_EXT_PARAMS v4l2_fourcc('R', 'K', '1', 'E') /* Rockchip ISP1 3a Extensible Parameters */ /* Vendor specific - used for RaspberryPi PiSP */ #define V4L2_META_FMT_RPI_BE_CFG v4l2_fourcc('R', 'P', 'B', 'C') /* PiSP BE configuration */ -- cgit v1.2.3 From 3d50c66c0609c8b64fb22e9c188fca88f34e7c98 Mon Sep 17 00:00:00 2001 From: Jakub Kicinski Date: Fri, 9 Aug 2024 22:37:26 -0700 Subject: ethtool: rss: support skipping contexts during dump Applications may want to deal with dynamic RSS contexts only. So dumping context 0 will be counter-productive for them. Support starting the dump from a given context ID. Alternative would be to implement a dump flag to skip just context 0, not sure which is better... Reviewed-by: Edward Cree Signed-off-by: Jakub Kicinski Signed-off-by: David S. Miller --- Documentation/netlink/specs/ethtool.yaml | 4 ++++ Documentation/networking/ethtool-netlink.rst | 12 ++++++++++-- include/uapi/linux/ethtool_netlink.h | 1 + net/ethtool/netlink.h | 2 +- net/ethtool/rss.c | 12 +++++++++++- 5 files changed, 27 insertions(+), 4 deletions(-) (limited to 'include/uapi') diff --git a/Documentation/netlink/specs/ethtool.yaml b/Documentation/netlink/specs/ethtool.yaml index cf69eedae51d..4c2334c213b0 100644 --- a/Documentation/netlink/specs/ethtool.yaml +++ b/Documentation/netlink/specs/ethtool.yaml @@ -1028,6 +1028,9 @@ attribute-sets: - name: input_xfrm type: u32 + - + name: start-context + type: u32 - name: plca attributes: @@ -1766,6 +1769,7 @@ operations: request: attributes: - header + - start-context reply: *rss-reply - name: plca-get-cfg diff --git a/Documentation/networking/ethtool-netlink.rst b/Documentation/networking/ethtool-netlink.rst index d5f246aceb9f..82c5542c80ce 100644 --- a/Documentation/networking/ethtool-netlink.rst +++ b/Documentation/networking/ethtool-netlink.rst @@ -1866,10 +1866,18 @@ RSS context of an interface similar to ``ETHTOOL_GRSSH`` ioctl request. Request contents: -===================================== ====== ========================== +===================================== ====== ============================ ``ETHTOOL_A_RSS_HEADER`` nested request header ``ETHTOOL_A_RSS_CONTEXT`` u32 context number -===================================== ====== ========================== + ``ETHTOOL_A_RSS_START_CONTEXT`` u32 start context number (dumps) +===================================== ====== ============================ + +``ETHTOOL_A_RSS_CONTEXT`` specifies which RSS context number to query, +if not set context 0 (the main context) is queried. Dumps can be filtered +by device (only listing contexts of a given netdev). Filtering single +context number is not supported but ``ETHTOOL_A_RSS_START_CONTEXT`` +can be used to start dumping context from the given number (primarily +used to ignore context 0s and only dump additional contexts). Kernel response contents: diff --git a/include/uapi/linux/ethtool_netlink.h b/include/uapi/linux/ethtool_netlink.h index 6d5bdcc67631..93c57525a975 100644 --- a/include/uapi/linux/ethtool_netlink.h +++ b/include/uapi/linux/ethtool_netlink.h @@ -965,6 +965,7 @@ enum { ETHTOOL_A_RSS_INDIR, /* binary */ ETHTOOL_A_RSS_HKEY, /* binary */ ETHTOOL_A_RSS_INPUT_XFRM, /* u32 */ + ETHTOOL_A_RSS_START_CONTEXT, /* u32 */ __ETHTOOL_A_RSS_CNT, ETHTOOL_A_RSS_MAX = (__ETHTOOL_A_RSS_CNT - 1), diff --git a/net/ethtool/netlink.h b/net/ethtool/netlink.h index 919371383b23..236c189fc968 100644 --- a/net/ethtool/netlink.h +++ b/net/ethtool/netlink.h @@ -449,7 +449,7 @@ extern const struct nla_policy ethnl_module_get_policy[ETHTOOL_A_MODULE_HEADER + extern const struct nla_policy ethnl_module_set_policy[ETHTOOL_A_MODULE_POWER_MODE_POLICY + 1]; extern const struct nla_policy ethnl_pse_get_policy[ETHTOOL_A_PSE_HEADER + 1]; extern const struct nla_policy ethnl_pse_set_policy[ETHTOOL_A_PSE_MAX + 1]; -extern const struct nla_policy ethnl_rss_get_policy[ETHTOOL_A_RSS_CONTEXT + 1]; +extern const struct nla_policy ethnl_rss_get_policy[ETHTOOL_A_RSS_START_CONTEXT + 1]; extern const struct nla_policy ethnl_plca_get_cfg_policy[ETHTOOL_A_PLCA_HEADER + 1]; extern const struct nla_policy ethnl_plca_set_cfg_policy[ETHTOOL_A_PLCA_MAX + 1]; extern const struct nla_policy ethnl_plca_get_status_policy[ETHTOOL_A_PLCA_HEADER + 1]; diff --git a/net/ethtool/rss.c b/net/ethtool/rss.c index 2865720ff583..e07386275e14 100644 --- a/net/ethtool/rss.c +++ b/net/ethtool/rss.c @@ -28,6 +28,7 @@ struct rss_reply_data { const struct nla_policy ethnl_rss_get_policy[] = { [ETHTOOL_A_RSS_HEADER] = NLA_POLICY_NESTED(ethnl_header_policy), [ETHTOOL_A_RSS_CONTEXT] = { .type = NLA_U32 }, + [ETHTOOL_A_RSS_START_CONTEXT] = { .type = NLA_U32 }, }; static int @@ -38,6 +39,10 @@ rss_parse_request(struct ethnl_req_info *req_info, struct nlattr **tb, if (tb[ETHTOOL_A_RSS_CONTEXT]) request->rss_context = nla_get_u32(tb[ETHTOOL_A_RSS_CONTEXT]); + if (tb[ETHTOOL_A_RSS_START_CONTEXT]) { + NL_SET_BAD_ATTR(extack, tb[ETHTOOL_A_RSS_START_CONTEXT]); + return -EINVAL; + } return 0; } @@ -214,6 +219,7 @@ struct rss_nl_dump_ctx { /* User wants to only dump contexts from given ifindex */ unsigned int match_ifindex; + unsigned int start_ctx; }; static struct rss_nl_dump_ctx *rss_dump_ctx(struct netlink_callback *cb) @@ -236,6 +242,10 @@ int ethnl_rss_dump_start(struct netlink_callback *cb) NL_SET_BAD_ATTR(info->extack, tb[ETHTOOL_A_RSS_CONTEXT]); return -EINVAL; } + if (tb[ETHTOOL_A_RSS_START_CONTEXT]) { + ctx->start_ctx = nla_get_u32(tb[ETHTOOL_A_RSS_START_CONTEXT]); + ctx->ctx_idx = ctx->start_ctx; + } ret = ethnl_parse_header_dev_get(&req_info, tb[ETHTOOL_A_RSS_HEADER], @@ -317,7 +327,7 @@ rss_dump_one_dev(struct sk_buff *skb, struct netlink_callback *cb, if (ret) return ret; } - ctx->ctx_idx = 0; + ctx->ctx_idx = ctx->start_ctx; return 0; } -- cgit v1.2.3 From 75bab45e6b2da379fe2ebda48ed35f8ce371a2ef Mon Sep 17 00:00:00 2001 From: Petr Machata Date: Wed, 7 Aug 2024 16:13:46 +0200 Subject: net: nexthop: Add flag to assert that NHGRP reserved fields are zero There are many unpatched kernel versions out there that do not initialize the reserved fields of struct nexthop_grp. The issue with that is that if those fields were to be used for some end (i.e. stop being reserved), old kernels would still keep sending random data through the field, and a new userspace could not rely on the value. In this patch, use the existing NHA_OP_FLAGS, which is currently inbound only, to carry flags back to the userspace. Add a flag to indicate that the reserved fields in struct nexthop_grp are zeroed before dumping. This is reliant on the actual fix from commit 6d745cd0e972 ("net: nexthop: Initialize all fields in dumped nexthops"). Signed-off-by: Petr Machata Reviewed-by: Ido Schimmel Link: https://patch.msgid.link/21037748d4f9d8ff486151f4c09083bcf12d5df8.1723036486.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski --- include/uapi/linux/nexthop.h | 3 +++ net/ipv4/nexthop.c | 12 +++++++++--- 2 files changed, 12 insertions(+), 3 deletions(-) (limited to 'include/uapi') diff --git a/include/uapi/linux/nexthop.h b/include/uapi/linux/nexthop.h index dd8787f9cf39..f4f060a87cc2 100644 --- a/include/uapi/linux/nexthop.h +++ b/include/uapi/linux/nexthop.h @@ -33,6 +33,9 @@ enum { #define NHA_OP_FLAG_DUMP_STATS BIT(0) #define NHA_OP_FLAG_DUMP_HW_STATS BIT(1) +/* Response OP_FLAGS. */ +#define NHA_OP_FLAG_RESP_GRP_RESVD_0 BIT(31) /* Dump clears resvd fields. */ + enum { NHA_UNSPEC, NHA_ID, /* u32; id for nexthop. id == 0 means auto-assign */ diff --git a/net/ipv4/nexthop.c b/net/ipv4/nexthop.c index 6b9787ee8601..23caa13bf24d 100644 --- a/net/ipv4/nexthop.c +++ b/net/ipv4/nexthop.c @@ -865,7 +865,7 @@ out: } static int nla_put_nh_group(struct sk_buff *skb, struct nexthop *nh, - u32 op_flags) + u32 op_flags, u32 *resp_op_flags) { struct nh_group *nhg = rtnl_dereference(nh->nh_grp); struct nexthop_grp *p; @@ -874,6 +874,8 @@ static int nla_put_nh_group(struct sk_buff *skb, struct nexthop *nh, u16 group_type = 0; int i; + *resp_op_flags |= NHA_OP_FLAG_RESP_GRP_RESVD_0; + if (nhg->hash_threshold) group_type = NEXTHOP_GRP_TYPE_MPATH; else if (nhg->resilient) @@ -934,10 +936,12 @@ static int nh_fill_node(struct sk_buff *skb, struct nexthop *nh, if (nh->is_group) { struct nh_group *nhg = rtnl_dereference(nh->nh_grp); + u32 resp_op_flags = 0; if (nhg->fdb_nh && nla_put_flag(skb, NHA_FDB)) goto nla_put_failure; - if (nla_put_nh_group(skb, nh, op_flags)) + if (nla_put_nh_group(skb, nh, op_flags, &resp_op_flags) || + nla_put_u32(skb, NHA_OP_FLAGS, resp_op_flags)) goto nla_put_failure; goto out; } @@ -1050,7 +1054,9 @@ static size_t nh_nlmsg_size(struct nexthop *nh) sz += nla_total_size(4); /* NHA_ID */ if (nh->is_group) - sz += nh_nlmsg_size_grp(nh); + sz += nh_nlmsg_size_grp(nh) + + nla_total_size(4) + /* NHA_OP_FLAGS */ + 0; else sz += nh_nlmsg_size_single(nh); -- cgit v1.2.3 From b72a6a7ab9573e06d5c2fcb92eaa28614a735bfd Mon Sep 17 00:00:00 2001 From: Petr Machata Date: Wed, 7 Aug 2024 16:13:47 +0200 Subject: net: nexthop: Increase weight to u16 In CLOS networks, as link failures occur at various points in the network, ECMP weights of the involved nodes are adjusted to compensate. With high fan-out of the involved nodes, and overall high number of nodes, a (non-)ECMP weight ratio that we would like to configure does not fit into 8 bits. Instead of, say, 255:254, we might like to configure something like 1000:999. For these deployments, the 8-bit weight may not be enough. To that end, in this patch increase the next hop weight from u8 to u16. Increasing the width of an integral type can be tricky, because while the code still compiles, the types may not check out anymore, and numerical errors come up. To prevent this, the conversion was done in two steps. First the type was changed from u8 to a single-member structure, which invalidated all uses of the field. This allowed going through them one by one and audit for type correctness. Then the structure was replaced with a vanilla u16 again. This should ensure that no place was missed. The UAPI for configuring nexthop group members is that an attribute NHA_GROUP carries an array of struct nexthop_grp entries: struct nexthop_grp { __u32 id; /* nexthop id - must exist */ __u8 weight; /* weight of this nexthop */ __u8 resvd1; __u16 resvd2; }; The field resvd1 is currently validated and required to be zero. We can lift this requirement and carry high-order bits of the weight in the reserved field: struct nexthop_grp { __u32 id; /* nexthop id - must exist */ __u8 weight; /* weight of this nexthop */ __u8 weight_high; __u16 resvd2; }; Keeping the fields split this way was chosen in case an existing userspace makes assumptions about the width of the weight field, and to sidestep any endianness issues. The weight field is currently encoded as the weight value minus one, because weight of 0 is invalid. This same trick is impossible for the new weight_high field, because zero must mean actual zero. With this in place: - Old userspace is guaranteed to carry weight_high of 0, therefore configuring 8-bit weights as appropriate. When dumping nexthops with 16-bit weight, it would only show the lower 8 bits. But configuring such nexthops implies existence of userspace aware of the extension in the first place. - New userspace talking to an old kernel will work as long as it only attempts to configure 8-bit weights, where the high-order bits are zero. Old kernel will bounce attempts at configuring >8-bit weights. Renaming reserved fields as they are allocated for some purpose is commonly done in Linux. Whoever touches a reserved field is doing so at their own risk. nexthop_grp::resvd1 in particular is currently used by at least strace, however they carry an own copy of UAPI headers, and the conversion should be trivial. A helper is provided for decoding the weight out of the two fields. Forcing a conversion seems preferable to bending backwards and introducing anonymous unions or whatever. Signed-off-by: Petr Machata Reviewed-by: Ido Schimmel Reviewed-by: David Ahern Reviewed-by: Przemek Kitszel Link: https://patch.msgid.link/483e2fcf4beb0d9135d62e7d27b46fa2685479d4.1723036486.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski --- include/net/nexthop.h | 4 ++-- include/uapi/linux/nexthop.h | 7 ++++++- net/ipv4/nexthop.c | 37 +++++++++++++++++++++++-------------- 3 files changed, 31 insertions(+), 17 deletions(-) (limited to 'include/uapi') diff --git a/include/net/nexthop.h b/include/net/nexthop.h index 68463aebcc05..d9fb44e8b321 100644 --- a/include/net/nexthop.h +++ b/include/net/nexthop.h @@ -105,7 +105,7 @@ struct nh_grp_entry_stats { struct nh_grp_entry { struct nexthop *nh; struct nh_grp_entry_stats __percpu *stats; - u8 weight; + u16 weight; union { struct { @@ -192,7 +192,7 @@ struct nh_notifier_single_info { }; struct nh_notifier_grp_entry_info { - u8 weight; + u16 weight; struct nh_notifier_single_info nh; }; diff --git a/include/uapi/linux/nexthop.h b/include/uapi/linux/nexthop.h index f4f060a87cc2..bc49baf4a267 100644 --- a/include/uapi/linux/nexthop.h +++ b/include/uapi/linux/nexthop.h @@ -16,10 +16,15 @@ struct nhmsg { struct nexthop_grp { __u32 id; /* nexthop id - must exist */ __u8 weight; /* weight of this nexthop */ - __u8 resvd1; + __u8 weight_high; /* high order bits of weight */ __u16 resvd2; }; +static inline __u16 nexthop_grp_weight(const struct nexthop_grp *entry) +{ + return ((entry->weight_high << 8) | entry->weight) + 1; +} + enum { NEXTHOP_GRP_TYPE_MPATH, /* hash-threshold nexthop group * default type if not specified diff --git a/net/ipv4/nexthop.c b/net/ipv4/nexthop.c index 23caa13bf24d..9db10d409074 100644 --- a/net/ipv4/nexthop.c +++ b/net/ipv4/nexthop.c @@ -872,6 +872,7 @@ static int nla_put_nh_group(struct sk_buff *skb, struct nexthop *nh, size_t len = nhg->num_nh * sizeof(*p); struct nlattr *nla; u16 group_type = 0; + u16 weight; int i; *resp_op_flags |= NHA_OP_FLAG_RESP_GRP_RESVD_0; @@ -890,9 +891,12 @@ static int nla_put_nh_group(struct sk_buff *skb, struct nexthop *nh, p = nla_data(nla); for (i = 0; i < nhg->num_nh; ++i) { + weight = nhg->nh_entries[i].weight - 1; + *p++ = (struct nexthop_grp) { .id = nhg->nh_entries[i].nh->id, - .weight = nhg->nh_entries[i].weight - 1, + .weight = weight, + .weight_high = weight >> 8, }; } @@ -1286,11 +1290,14 @@ static int nh_check_attr_group(struct net *net, nhg = nla_data(tb[NHA_GROUP]); for (i = 0; i < len; ++i) { - if (nhg[i].resvd1 || nhg[i].resvd2) { - NL_SET_ERR_MSG(extack, "Reserved fields in nexthop_grp must be 0"); + if (nhg[i].resvd2) { + NL_SET_ERR_MSG(extack, "Reserved field in nexthop_grp must be 0"); return -EINVAL; } - if (nhg[i].weight > 254) { + if (nexthop_grp_weight(&nhg[i]) == 0) { + /* 0xffff got passed in, representing weight of 0x10000, + * which is too heavy. + */ NL_SET_ERR_MSG(extack, "Invalid value for weight"); return -EINVAL; } @@ -1886,9 +1893,9 @@ static void nh_res_table_cancel_upkeep(struct nh_res_table *res_table) static void nh_res_group_rebalance(struct nh_group *nhg, struct nh_res_table *res_table) { - int prev_upper_bound = 0; - int total = 0; - int w = 0; + u16 prev_upper_bound = 0; + u32 total = 0; + u32 w = 0; int i; INIT_LIST_HEAD(&res_table->uw_nh_entries); @@ -1898,11 +1905,12 @@ static void nh_res_group_rebalance(struct nh_group *nhg, for (i = 0; i < nhg->num_nh; ++i) { struct nh_grp_entry *nhge = &nhg->nh_entries[i]; - int upper_bound; + u16 upper_bound; + u64 btw; w += nhge->weight; - upper_bound = DIV_ROUND_CLOSEST(res_table->num_nh_buckets * w, - total); + btw = ((u64)res_table->num_nh_buckets) * w; + upper_bound = DIV_ROUND_CLOSEST_ULL(btw, total); nhge->res.wants_buckets = upper_bound - prev_upper_bound; prev_upper_bound = upper_bound; @@ -1968,8 +1976,8 @@ static void replace_nexthop_grp_res(struct nh_group *oldg, static void nh_hthr_group_rebalance(struct nh_group *nhg) { - int total = 0; - int w = 0; + u32 total = 0; + u32 w = 0; int i; for (i = 0; i < nhg->num_nh; ++i) @@ -1977,7 +1985,7 @@ static void nh_hthr_group_rebalance(struct nh_group *nhg) for (i = 0; i < nhg->num_nh; ++i) { struct nh_grp_entry *nhge = &nhg->nh_entries[i]; - int upper_bound; + u32 upper_bound; w += nhge->weight; upper_bound = DIV_ROUND_CLOSEST_ULL((u64)w << 31, total) - 1; @@ -2719,7 +2727,8 @@ static struct nexthop *nexthop_create_group(struct net *net, goto out_no_nh; } nhg->nh_entries[i].nh = nhe; - nhg->nh_entries[i].weight = entry[i].weight + 1; + nhg->nh_entries[i].weight = nexthop_grp_weight(&entry[i]); + list_add(&nhg->nh_entries[i].nh_list, &nhe->grp_list); nhg->nh_entries[i].nh_parent = nh; } -- cgit v1.2.3 From c26cee817f8bd9a22bfade20f739ec2fc6f20221 Mon Sep 17 00:00:00 2001 From: David Sands Date: Sat, 10 Aug 2024 20:00:05 -0400 Subject: usb: gadget: f_fs: add capability for dfu functional descriptor Add the ability for the USB FunctionFS (FFS) gadget driver to be able to create Device Firmware Upgrade (DFU) functional descriptors. [1] This patch allows implementation of DFU in userspace using the FFS gadget. The DFU protocol uses the control pipe (ep0) for all messaging so only the addition of the DFU functional descriptor is needed in the kernel driver. The DFU functional descriptor is written to the ep0 file along with any other descriptors during FFS setup. DFU requires an interface descriptor followed by the DFU functional descriptor. This patch includes documentation of the added descriptor for DFU and conversion of some existing documentation to kernel-doc format so that it can be included in the generated docs. An implementation of DFU 1.1 that implements just the runtime descriptor using the FunctionFS gadget (with rebooting into u-boot for DFU mode) has been tested on an i.MX8 Nano. An implementation of DFU 1.1 that implements both runtime and DFU mode using the FunctionFS gadget has been tested on Xilinx Zynq UltraScale+. Note that for the best performance of firmware update file transfers, the userspace program should respond as quick as possible to the setup packets. [1] https://www.usb.org/sites/default/files/DFU_1.1.pdf Signed-off-by: David Sands Co-developed-by: Chris Wulff Signed-off-by: Chris Wulff Link: https://lore.kernel.org/r/20240811000004.1395888-2-crwulff@gmail.com Signed-off-by: Greg Kroah-Hartman --- Documentation/usb/functionfs-desc.rst | 39 ++++++++++++++ Documentation/usb/functionfs.rst | 2 + Documentation/usb/index.rst | 1 + drivers/usb/gadget/function/f_fs.c | 12 ++++- include/uapi/linux/usb/ch9.h | 8 ++- include/uapi/linux/usb/functionfs.h | 97 ++++++++++++++++++++++++++++++++--- 6 files changed, 147 insertions(+), 12 deletions(-) create mode 100644 Documentation/usb/functionfs-desc.rst (limited to 'include/uapi') diff --git a/Documentation/usb/functionfs-desc.rst b/Documentation/usb/functionfs-desc.rst new file mode 100644 index 000000000000..39649774da54 --- /dev/null +++ b/Documentation/usb/functionfs-desc.rst @@ -0,0 +1,39 @@ +====================== +FunctionFS Descriptors +====================== + +Some of the descriptors that can be written to the FFS gadget are +described below. Device and configuration descriptors are handled +by the composite gadget and are not written by the user to the +FFS gadget. + +Descriptors are written to the "ep0" file in the FFS gadget +following the descriptor header. + +.. kernel-doc:: include/uapi/linux/usb/functionfs.h + :doc: descriptors + +Interface Descriptors +--------------------- + +Standard USB interface descriptors may be written. The class/subclass of the +most recent interface descriptor determines what type of class-specific +descriptors are accepted. + +Class-Specific Descriptors +-------------------------- + +Class-specific descriptors are accepted only for the class/subclass of the +most recent interface descriptor. The following are some of the +class-specific descriptors that are supported. + +DFU Functional Descriptor +~~~~~~~~~~~~~~~~~~~~~~~~~ + +When the interface class is USB_CLASS_APP_SPEC and the interface subclass +is USB_SUBCLASS_DFU, a DFU functional descriptor can be provided. +The DFU functional descriptor is a described in the USB specification for +Device Firmware Upgrade (DFU), version 1.1 as of this writing. + +.. kernel-doc:: include/uapi/linux/usb/functionfs.h + :doc: usb_dfu_functional_descriptor diff --git a/Documentation/usb/functionfs.rst b/Documentation/usb/functionfs.rst index d05a775bc45b..f7487b0d8057 100644 --- a/Documentation/usb/functionfs.rst +++ b/Documentation/usb/functionfs.rst @@ -25,6 +25,8 @@ interface numbers starting from zero). The FunctionFS changes them as needed also handling situation when numbers differ in different configurations. +For more information about FunctionFS descriptors see :doc:`functionfs-desc` + When descriptors and strings are written "ep#" files appear (one for each declared endpoint) which handle communication on a single endpoint. Again, FunctionFS takes care of the real diff --git a/Documentation/usb/index.rst b/Documentation/usb/index.rst index 27955dad95e1..826492c813ac 100644 --- a/Documentation/usb/index.rst +++ b/Documentation/usb/index.rst @@ -11,6 +11,7 @@ USB support dwc3 ehci functionfs + functionfs-desc gadget_configfs gadget_hid gadget_multi diff --git a/drivers/usb/gadget/function/f_fs.c b/drivers/usb/gadget/function/f_fs.c index e0ceaa721949..e35177fc6c55 100644 --- a/drivers/usb/gadget/function/f_fs.c +++ b/drivers/usb/gadget/function/f_fs.c @@ -2478,7 +2478,7 @@ typedef int (*ffs_os_desc_callback)(enum ffs_os_desc_type entity, static int __must_check ffs_do_single_desc(char *data, unsigned len, ffs_entity_callback entity, - void *priv, int *current_class) + void *priv, int *current_class, int *current_subclass) { struct usb_descriptor_header *_ds = (void *)data; u8 length; @@ -2535,6 +2535,7 @@ static int __must_check ffs_do_single_desc(char *data, unsigned len, if (ds->iInterface) __entity(STRING, ds->iInterface); *current_class = ds->bInterfaceClass; + *current_subclass = ds->bInterfaceSubClass; } break; @@ -2559,6 +2560,12 @@ static int __must_check ffs_do_single_desc(char *data, unsigned len, if (length != sizeof(struct ccid_descriptor)) goto inv_length; break; + } else if (*current_class == USB_CLASS_APP_SPEC && + *current_subclass == USB_SUBCLASS_DFU) { + pr_vdebug("dfu functional descriptor\n"); + if (length != sizeof(struct usb_dfu_functional_descriptor)) + goto inv_length; + break; } else { pr_vdebug("unknown descriptor: %d for class %d\n", _ds->bDescriptorType, *current_class); @@ -2621,6 +2628,7 @@ static int __must_check ffs_do_descs(unsigned count, char *data, unsigned len, const unsigned _len = len; unsigned long num = 0; int current_class = -1; + int current_subclass = -1; for (;;) { int ret; @@ -2640,7 +2648,7 @@ static int __must_check ffs_do_descs(unsigned count, char *data, unsigned len, return _len - len; ret = ffs_do_single_desc(data, len, entity, priv, - ¤t_class); + ¤t_class, ¤t_subclass); if (ret < 0) { pr_debug("%s returns %d\n", __func__, ret); return ret; diff --git a/include/uapi/linux/usb/ch9.h b/include/uapi/linux/usb/ch9.h index 44d73ba8788d..91f0f7e214a5 100644 --- a/include/uapi/linux/usb/ch9.h +++ b/include/uapi/linux/usb/ch9.h @@ -254,6 +254,9 @@ struct usb_ctrlrequest { #define USB_DT_DEVICE_CAPABILITY 0x10 #define USB_DT_WIRELESS_ENDPOINT_COMP 0x11 #define USB_DT_WIRE_ADAPTER 0x21 +/* From USB Device Firmware Upgrade Specification, Revision 1.1 */ +#define USB_DT_DFU_FUNCTIONAL 0x21 +/* these are from the Wireless USB spec */ #define USB_DT_RPIPE 0x22 #define USB_DT_CS_RADIO_CONTROL 0x23 /* From the T10 UAS specification */ @@ -329,9 +332,10 @@ struct usb_device_descriptor { #define USB_CLASS_USB_TYPE_C_BRIDGE 0x12 #define USB_CLASS_MISC 0xef #define USB_CLASS_APP_SPEC 0xfe -#define USB_CLASS_VENDOR_SPEC 0xff +#define USB_SUBCLASS_DFU 0x01 -#define USB_SUBCLASS_VENDOR_SPEC 0xff +#define USB_CLASS_VENDOR_SPEC 0xff +#define USB_SUBCLASS_VENDOR_SPEC 0xff /*-------------------------------------------------------------------------*/ diff --git a/include/uapi/linux/usb/functionfs.h b/include/uapi/linux/usb/functionfs.h index 9f88de9c3d66..2ebdba111a8f 100644 --- a/include/uapi/linux/usb/functionfs.h +++ b/include/uapi/linux/usb/functionfs.h @@ -3,6 +3,7 @@ #define _UAPI__LINUX_FUNCTIONFS_H__ +#include #include #include @@ -37,6 +38,31 @@ struct usb_endpoint_descriptor_no_audio { __u8 bInterval; } __attribute__((packed)); +/** + * struct usb_dfu_functional_descriptor - DFU Functional descriptor + * @bLength: Size of the descriptor (bytes) + * @bDescriptorType: USB_DT_DFU_FUNCTIONAL + * @bmAttributes: DFU attributes + * @wDetachTimeOut: Maximum time to wait after DFU_DETACH (ms, le16) + * @wTransferSize: Maximum number of bytes per control-write (le16) + * @bcdDFUVersion: DFU Spec version (BCD, le16) + */ +struct usb_dfu_functional_descriptor { + __u8 bLength; + __u8 bDescriptorType; + __u8 bmAttributes; + __le16 wDetachTimeOut; + __le16 wTransferSize; + __le16 bcdDFUVersion; +} __attribute__ ((packed)); + +/* from DFU functional descriptor bmAttributes */ +#define DFU_FUNC_ATT_CAN_DOWNLOAD _BITUL(0) +#define DFU_FUNC_ATT_CAN_UPLOAD _BITUL(1) +#define DFU_FUNC_ATT_MANIFEST_TOLERANT _BITUL(2) +#define DFU_FUNC_ATT_WILL_DETACH _BITUL(3) + + struct usb_functionfs_descs_head_v2 { __le32 magic; __le32 length; @@ -104,23 +130,38 @@ struct usb_ffs_dmabuf_transfer_req { #ifndef __KERNEL__ -/* +/** + * DOC: descriptors + * * Descriptors format: * + * +-----+-----------+--------------+--------------------------------------+ * | off | name | type | description | - * |-----+-----------+--------------+--------------------------------------| + * +-----+-----------+--------------+--------------------------------------+ * | 0 | magic | LE32 | FUNCTIONFS_DESCRIPTORS_MAGIC_V2 | + * +-----+-----------+--------------+--------------------------------------+ * | 4 | length | LE32 | length of the whole data chunk | + * +-----+-----------+--------------+--------------------------------------+ * | 8 | flags | LE32 | combination of functionfs_flags | + * +-----+-----------+--------------+--------------------------------------+ * | | eventfd | LE32 | eventfd file descriptor | + * +-----+-----------+--------------+--------------------------------------+ * | | fs_count | LE32 | number of full-speed descriptors | + * +-----+-----------+--------------+--------------------------------------+ * | | hs_count | LE32 | number of high-speed descriptors | + * +-----+-----------+--------------+--------------------------------------+ * | | ss_count | LE32 | number of super-speed descriptors | + * +-----+-----------+--------------+--------------------------------------+ * | | os_count | LE32 | number of MS OS descriptors | + * +-----+-----------+--------------+--------------------------------------+ * | | fs_descrs | Descriptor[] | list of full-speed descriptors | + * +-----+-----------+--------------+--------------------------------------+ * | | hs_descrs | Descriptor[] | list of high-speed descriptors | + * +-----+-----------+--------------+--------------------------------------+ * | | ss_descrs | Descriptor[] | list of super-speed descriptors | + * +-----+-----------+--------------+--------------------------------------+ * | | os_descrs | OSDesc[] | list of MS OS descriptors | + * +-----+-----------+--------------+--------------------------------------+ * * Depending on which flags are set, various fields may be missing in the * structure. Any flags that are not recognised cause the whole block to be @@ -128,71 +169,111 @@ struct usb_ffs_dmabuf_transfer_req { * * Legacy descriptors format (deprecated as of 3.14): * + * +-----+-----------+--------------+--------------------------------------+ * | off | name | type | description | - * |-----+-----------+--------------+--------------------------------------| + * +-----+-----------+--------------+--------------------------------------+ * | 0 | magic | LE32 | FUNCTIONFS_DESCRIPTORS_MAGIC | + * +-----+-----------+--------------+--------------------------------------+ * | 4 | length | LE32 | length of the whole data chunk | + * +-----+-----------+--------------+--------------------------------------+ * | 8 | fs_count | LE32 | number of full-speed descriptors | + * +-----+-----------+--------------+--------------------------------------+ * | 12 | hs_count | LE32 | number of high-speed descriptors | + * +-----+-----------+--------------+--------------------------------------+ * | 16 | fs_descrs | Descriptor[] | list of full-speed descriptors | + * +-----+-----------+--------------+--------------------------------------+ * | | hs_descrs | Descriptor[] | list of high-speed descriptors | + * +-----+-----------+--------------+--------------------------------------+ * * All numbers must be in little endian order. * * Descriptor[] is an array of valid USB descriptors which have the following * format: * + * +-----+-----------------+------+--------------------------+ * | off | name | type | description | - * |-----+-----------------+------+--------------------------| + * +-----+-----------------+------+--------------------------+ * | 0 | bLength | U8 | length of the descriptor | + * +-----+-----------------+------+--------------------------+ * | 1 | bDescriptorType | U8 | descriptor type | + * +-----+-----------------+------+--------------------------+ * | 2 | payload | | descriptor's payload | + * +-----+-----------------+------+--------------------------+ * * OSDesc[] is an array of valid MS OS Feature Descriptors which have one of * the following formats: * + * +-----+-----------------+------+--------------------------+ * | off | name | type | description | - * |-----+-----------------+------+--------------------------| + * +-----+-----------------+------+--------------------------+ * | 0 | inteface | U8 | related interface number | + * +-----+-----------------+------+--------------------------+ * | 1 | dwLength | U32 | length of the descriptor | + * +-----+-----------------+------+--------------------------+ * | 5 | bcdVersion | U16 | currently supported: 1 | + * +-----+-----------------+------+--------------------------+ * | 7 | wIndex | U16 | currently supported: 4 | + * +-----+-----------------+------+--------------------------+ * | 9 | bCount | U8 | number of ext. compat. | + * +-----+-----------------+------+--------------------------+ * | 10 | Reserved | U8 | 0 | + * +-----+-----------------+------+--------------------------+ * | 11 | ExtCompat[] | | list of ext. compat. d. | + * +-----+-----------------+------+--------------------------+ * + * +-----+-----------------+------+--------------------------+ * | off | name | type | description | - * |-----+-----------------+------+--------------------------| + * +-----+-----------------+------+--------------------------+ * | 0 | inteface | U8 | related interface number | + * +-----+-----------------+------+--------------------------+ * | 1 | dwLength | U32 | length of the descriptor | + * +-----+-----------------+------+--------------------------+ * | 5 | bcdVersion | U16 | currently supported: 1 | + * +-----+-----------------+------+--------------------------+ * | 7 | wIndex | U16 | currently supported: 5 | + * +-----+-----------------+------+--------------------------+ * | 9 | wCount | U16 | number of ext. compat. | + * +-----+-----------------+------+--------------------------+ * | 11 | ExtProp[] | | list of ext. prop. d. | + * +-----+-----------------+------+--------------------------+ * * ExtCompat[] is an array of valid Extended Compatiblity descriptors * which have the following format: * + * +-----+-----------------------+------+-------------------------------------+ * | off | name | type | description | - * |-----+-----------------------+------+-------------------------------------| + * +-----+-----------------------+------+-------------------------------------+ * | 0 | bFirstInterfaceNumber | U8 | index of the interface or of the 1st| + * +-----+-----------------------+------+-------------------------------------+ * | | | | interface in an IAD group | + * +-----+-----------------------+------+-------------------------------------+ * | 1 | Reserved | U8 | 1 | + * +-----+-----------------------+------+-------------------------------------+ * | 2 | CompatibleID | U8[8]| compatible ID string | + * +-----+-----------------------+------+-------------------------------------+ * | 10 | SubCompatibleID | U8[8]| subcompatible ID string | + * +-----+-----------------------+------+-------------------------------------+ * | 18 | Reserved | U8[6]| 0 | + * +-----+-----------------------+------+-------------------------------------+ * * ExtProp[] is an array of valid Extended Properties descriptors * which have the following format: * + * +-----+-----------------------+------+-------------------------------------+ * | off | name | type | description | - * |-----+-----------------------+------+-------------------------------------| + * +-----+-----------------------+------+-------------------------------------+ * | 0 | dwSize | U32 | length of the descriptor | + * +-----+-----------------------+------+-------------------------------------+ * | 4 | dwPropertyDataType | U32 | 1..7 | + * +-----+-----------------------+------+-------------------------------------+ * | 8 | wPropertyNameLength | U16 | bPropertyName length (NL) | + * +-----+-----------------------+------+-------------------------------------+ * | 10 | bPropertyName |U8[NL]| name of this property | + * +-----+-----------------------+------+-------------------------------------+ * |10+NL| dwPropertyDataLength | U32 | bPropertyData length (DL) | + * +-----+-----------------------+------+-------------------------------------+ * |14+NL| bProperty |U8[DL]| payload of this property | + * +-----+-----------------------+------+-------------------------------------+ */ struct usb_functionfs_strings_head { -- cgit v1.2.3 From ac79beb913dc40b1a49b628e31afdfeb20194ab6 Mon Sep 17 00:00:00 2001 From: Paul Elder Date: Thu, 8 Aug 2024 22:41:05 +0200 Subject: media: rkisp1: Add support for the companding block Add support to the rkisp1 driver for the companding block that exists on the i.MX8MP version of the ISP. This requires usage of the new extensible parameters format, and showcases how the format allows for extensions without breaking backward compatibility. Signed-off-by: Paul Elder Reviewed-by: Jacopo Mondi Reviewed-by: Paul Elder Signed-off-by: Jacopo Mondi Tested-by: Kieran Bingham Acked-by: Sakari Ailus Signed-off-by: Laurent Pinchart --- .../media/platform/rockchip/rkisp1/rkisp1-params.c | 168 +++++++++++++++++++++ include/uapi/linux/rkisp1-config.h | 89 ++++++++++- 2 files changed, 256 insertions(+), 1 deletion(-) (limited to 'include/uapi') diff --git a/drivers/media/platform/rockchip/rkisp1/rkisp1-params.c b/drivers/media/platform/rockchip/rkisp1/rkisp1-params.c index 9cbbc6adef39..320581a9f866 100644 --- a/drivers/media/platform/rockchip/rkisp1/rkisp1-params.c +++ b/drivers/media/platform/rockchip/rkisp1/rkisp1-params.c @@ -5,6 +5,7 @@ * Copyright (C) 2017 Rockchip Electronics Co., Ltd. */ +#include #include #include @@ -57,6 +58,8 @@ union rkisp1_ext_params_config { struct rkisp1_ext_params_hst_config hst; struct rkisp1_ext_params_aec_config aec; struct rkisp1_ext_params_afc_config afc; + struct rkisp1_ext_params_compand_bls_config compand_bls; + struct rkisp1_ext_params_compand_curve_config compand_curve; }; enum rkisp1_params_formats { @@ -1258,6 +1261,93 @@ rkisp1_dpf_strength_config(struct rkisp1_params *params, rkisp1_write(params->rkisp1, RKISP1_CIF_ISP_DPF_STRENGTH_R, arg->r); } +static void rkisp1_compand_write_px_curve(struct rkisp1_params *params, + unsigned int addr, const u8 *curve) +{ + const unsigned int points_per_reg = 6; + const unsigned int num_regs = + DIV_ROUND_UP(RKISP1_CIF_ISP_COMPAND_NUM_POINTS, + points_per_reg); + + /* + * The compand curve is specified as a piecewise linear function with + * 64 points. X coordinates are stored as a log2 of the displacement + * from the previous point, in 5 bits, with 6 values per register. The + * last register stores 4 values. + */ + for (unsigned int reg = 0; reg < num_regs; ++reg) { + unsigned int num_points = + min(RKISP1_CIF_ISP_COMPAND_NUM_POINTS - + reg * points_per_reg, points_per_reg); + u32 val = 0; + + for (unsigned int i = 0; i < num_points; i++) + val |= (*curve++ & 0x1f) << (i * 5); + + rkisp1_write(params->rkisp1, addr, val); + addr += 4; + } +} + +static void +rkisp1_compand_write_curve_mem(struct rkisp1_params *params, + unsigned int reg_addr, unsigned int reg_data, + const u32 curve[RKISP1_CIF_ISP_COMPAND_NUM_POINTS]) +{ + for (unsigned int i = 0; i < RKISP1_CIF_ISP_COMPAND_NUM_POINTS; i++) { + rkisp1_write(params->rkisp1, reg_addr, i); + rkisp1_write(params->rkisp1, reg_data, curve[i]); + } +} + +static void +rkisp1_compand_bls_config(struct rkisp1_params *params, + const struct rkisp1_cif_isp_compand_bls_config *arg) +{ + static const u32 regs[] = { + RKISP1_CIF_ISP_COMPAND_BLS_A_FIXED, + RKISP1_CIF_ISP_COMPAND_BLS_B_FIXED, + RKISP1_CIF_ISP_COMPAND_BLS_C_FIXED, + RKISP1_CIF_ISP_COMPAND_BLS_D_FIXED, + }; + u32 swapped[4]; + + rkisp1_bls_swap_regs(params->raw_type, regs, swapped); + + rkisp1_write(params->rkisp1, swapped[0], arg->r); + rkisp1_write(params->rkisp1, swapped[1], arg->gr); + rkisp1_write(params->rkisp1, swapped[2], arg->gb); + rkisp1_write(params->rkisp1, swapped[3], arg->b); +} + +static void +rkisp1_compand_expand_config(struct rkisp1_params *params, + const struct rkisp1_cif_isp_compand_curve_config *arg) +{ + rkisp1_compand_write_px_curve(params, RKISP1_CIF_ISP_COMPAND_EXPAND_PX_N(0), + arg->px); + rkisp1_compand_write_curve_mem(params, RKISP1_CIF_ISP_COMPAND_EXPAND_Y_ADDR, + RKISP1_CIF_ISP_COMPAND_EXPAND_Y_WRITE_DATA, + arg->y); + rkisp1_compand_write_curve_mem(params, RKISP1_CIF_ISP_COMPAND_EXPAND_X_ADDR, + RKISP1_CIF_ISP_COMPAND_EXPAND_X_WRITE_DATA, + arg->x); +} + +static void +rkisp1_compand_compress_config(struct rkisp1_params *params, + const struct rkisp1_cif_isp_compand_curve_config *arg) +{ + rkisp1_compand_write_px_curve(params, RKISP1_CIF_ISP_COMPAND_COMPRESS_PX_N(0), + arg->px); + rkisp1_compand_write_curve_mem(params, RKISP1_CIF_ISP_COMPAND_COMPRESS_Y_ADDR, + RKISP1_CIF_ISP_COMPAND_COMPRESS_Y_WRITE_DATA, + arg->y); + rkisp1_compand_write_curve_mem(params, RKISP1_CIF_ISP_COMPAND_COMPRESS_X_ADDR, + RKISP1_CIF_ISP_COMPAND_COMPRESS_X_WRITE_DATA, + arg->x); +} + static void rkisp1_isp_isr_other_config(struct rkisp1_params *params, const struct rkisp1_params_cfg *new_params) @@ -1855,6 +1945,66 @@ rkisp1_ext_params_afcm(struct rkisp1_params *params, RKISP1_CIF_ISP_AFM_ENA); } +static void rkisp1_ext_params_compand_bls(struct rkisp1_params *params, + const union rkisp1_ext_params_config *block) +{ + const struct rkisp1_ext_params_compand_bls_config *bls = + &block->compand_bls; + + if (bls->header.flags & RKISP1_EXT_PARAMS_FL_BLOCK_DISABLE) { + rkisp1_param_clear_bits(params, RKISP1_CIF_ISP_COMPAND_CTRL, + RKISP1_CIF_ISP_COMPAND_CTRL_BLS_ENABLE); + return; + } + + rkisp1_compand_bls_config(params, &bls->config); + + if ((bls->header.flags & RKISP1_EXT_PARAMS_FL_BLOCK_ENABLE) && + !(params->enabled_blocks & BIT(bls->header.type))) + rkisp1_param_set_bits(params, RKISP1_CIF_ISP_COMPAND_CTRL, + RKISP1_CIF_ISP_COMPAND_CTRL_BLS_ENABLE); +} + +static void rkisp1_ext_params_compand_expand(struct rkisp1_params *params, + const union rkisp1_ext_params_config *block) +{ + const struct rkisp1_ext_params_compand_curve_config *curve = + &block->compand_curve; + + if (curve->header.flags & RKISP1_EXT_PARAMS_FL_BLOCK_DISABLE) { + rkisp1_param_clear_bits(params, RKISP1_CIF_ISP_COMPAND_CTRL, + RKISP1_CIF_ISP_COMPAND_CTRL_EXPAND_ENABLE); + return; + } + + rkisp1_compand_expand_config(params, &curve->config); + + if ((curve->header.flags & RKISP1_EXT_PARAMS_FL_BLOCK_ENABLE) && + !(params->enabled_blocks & BIT(curve->header.type))) + rkisp1_param_set_bits(params, RKISP1_CIF_ISP_COMPAND_CTRL, + RKISP1_CIF_ISP_COMPAND_CTRL_EXPAND_ENABLE); +} + +static void rkisp1_ext_params_compand_compress(struct rkisp1_params *params, + const union rkisp1_ext_params_config *block) +{ + const struct rkisp1_ext_params_compand_curve_config *curve = + &block->compand_curve; + + if (curve->header.flags & RKISP1_EXT_PARAMS_FL_BLOCK_DISABLE) { + rkisp1_param_clear_bits(params, RKISP1_CIF_ISP_COMPAND_CTRL, + RKISP1_CIF_ISP_COMPAND_CTRL_COMPRESS_ENABLE); + return; + } + + rkisp1_compand_compress_config(params, &curve->config); + + if ((curve->header.flags & RKISP1_EXT_PARAMS_FL_BLOCK_ENABLE) && + !(params->enabled_blocks & BIT(curve->header.type))) + rkisp1_param_set_bits(params, RKISP1_CIF_ISP_COMPAND_CTRL, + RKISP1_CIF_ISP_COMPAND_CTRL_COMPRESS_ENABLE); +} + typedef void (*rkisp1_block_handler)(struct rkisp1_params *params, const union rkisp1_ext_params_config *config); @@ -1950,6 +2100,24 @@ static const struct rkisp1_ext_params_handler { .handler = rkisp1_ext_params_afcm, .group = RKISP1_EXT_PARAMS_BLOCK_GROUP_OTHERS, }, + [RKISP1_EXT_PARAMS_BLOCK_TYPE_COMPAND_BLS] = { + .size = sizeof(struct rkisp1_ext_params_compand_bls_config), + .handler = rkisp1_ext_params_compand_bls, + .group = RKISP1_EXT_PARAMS_BLOCK_GROUP_OTHERS, + .features = RKISP1_FEATURE_COMPAND, + }, + [RKISP1_EXT_PARAMS_BLOCK_TYPE_COMPAND_EXPAND] = { + .size = sizeof(struct rkisp1_ext_params_compand_curve_config), + .handler = rkisp1_ext_params_compand_expand, + .group = RKISP1_EXT_PARAMS_BLOCK_GROUP_OTHERS, + .features = RKISP1_FEATURE_COMPAND, + }, + [RKISP1_EXT_PARAMS_BLOCK_TYPE_COMPAND_COMPRESS] = { + .size = sizeof(struct rkisp1_ext_params_compand_curve_config), + .handler = rkisp1_ext_params_compand_compress, + .group = RKISP1_EXT_PARAMS_BLOCK_GROUP_OTHERS, + .features = RKISP1_FEATURE_COMPAND, + }, }; static void rkisp1_ext_params_config(struct rkisp1_params *params, diff --git a/include/uapi/linux/rkisp1-config.h b/include/uapi/linux/rkisp1-config.h index 9767bb19c72d..430daceafac7 100644 --- a/include/uapi/linux/rkisp1-config.h +++ b/include/uapi/linux/rkisp1-config.h @@ -164,6 +164,11 @@ #define RKISP1_CIF_ISP_DPF_MAX_NLF_COEFFS 17 #define RKISP1_CIF_ISP_DPF_MAX_SPATIAL_COEFFS 6 +/* + * Compand + */ +#define RKISP1_CIF_ISP_COMPAND_NUM_POINTS 64 + /* * Measurement types */ @@ -851,6 +856,39 @@ struct rkisp1_params_cfg { struct rkisp1_cif_isp_isp_other_cfg others; }; +/** + * struct rkisp1_cif_isp_compand_bls_config - Rockchip ISP1 Companding parameters (BLS) + * @r: Fixed subtraction value for Bayer pattern R + * @gr: Fixed subtraction value for Bayer pattern Gr + * @gb: Fixed subtraction value for Bayer pattern Gb + * @b: Fixed subtraction value for Bayer pattern B + * + * The values will be subtracted from the sensor values. Note that unlike the + * dedicated BLS block, the BLS values in the compander are 20-bit unsigned. + */ +struct rkisp1_cif_isp_compand_bls_config { + __u32 r; + __u32 gr; + __u32 gb; + __u32 b; +}; + +/** + * struct rkisp1_cif_isp_compand_curve_config - Rockchip ISP1 Companding + * parameters (expand and compression curves) + * @px: Compand curve x-values. Each value stores the distance from the + * previous x-value, expressed as log2 of the distance on 5 bits. + * @x: Compand curve x-values. The functionality of these parameters are + * unknown due to do a lack of hardware documentation, but these are left + * here for future compatibility purposes. + * @y: Compand curve y-values + */ +struct rkisp1_cif_isp_compand_curve_config { + __u8 px[RKISP1_CIF_ISP_COMPAND_NUM_POINTS]; + __u32 x[RKISP1_CIF_ISP_COMPAND_NUM_POINTS]; + __u32 y[RKISP1_CIF_ISP_COMPAND_NUM_POINTS]; +}; + /*---------- PART2: Measurement Statistics ------------*/ /** @@ -1018,6 +1056,9 @@ struct rkisp1_stat_buffer { * @RKISP1_EXT_PARAMS_BLOCK_TYPE_HST_MEAS: Histogram statistics * @RKISP1_EXT_PARAMS_BLOCK_TYPE_AEC_MEAS: Auto exposure statistics * @RKISP1_EXT_PARAMS_BLOCK_TYPE_AFC_MEAS: Auto-focus statistics + * @RKISP1_EXT_PARAMS_BLOCK_TYPE_COMPAND_BLS: BLS in the compand block + * @RKISP1_EXT_PARAMS_BLOCK_TYPE_COMPAND_EXPAND: Companding expand curve + * @RKISP1_EXT_PARAMS_BLOCK_TYPE_COMPAND_COMPRESS: Companding compress curve */ enum rkisp1_ext_params_block_type { RKISP1_EXT_PARAMS_BLOCK_TYPE_BLS, @@ -1037,6 +1078,9 @@ enum rkisp1_ext_params_block_type { RKISP1_EXT_PARAMS_BLOCK_TYPE_HST_MEAS, RKISP1_EXT_PARAMS_BLOCK_TYPE_AEC_MEAS, RKISP1_EXT_PARAMS_BLOCK_TYPE_AFC_MEAS, + RKISP1_EXT_PARAMS_BLOCK_TYPE_COMPAND_BLS, + RKISP1_EXT_PARAMS_BLOCK_TYPE_COMPAND_EXPAND, + RKISP1_EXT_PARAMS_BLOCK_TYPE_COMPAND_COMPRESS, }; #define RKISP1_EXT_PARAMS_FL_BLOCK_DISABLE (1U << 0) @@ -1380,6 +1424,46 @@ struct rkisp1_ext_params_afc_config { struct rkisp1_cif_isp_afc_config config; } __attribute__((aligned(8))); +/** + * struct rkisp1_ext_params_compand_bls_config - RkISP1 extensible params + * Compand BLS config + * + * RkISP1 extensible parameters Companding configuration block (black level + * subtraction). Identified by :c:type:`RKISP1_EXT_PARAMS_BLOCK_TYPE_COMPAND_BLS`. + * + * @header: The RkISP1 extensible parameters header, see + * :c:type:`rkisp1_ext_params_block_header` + * @config: Companding BLS configuration, see + * :c:type:`rkisp1_cif_isp_compand_bls_config` + */ +struct rkisp1_ext_params_compand_bls_config { + struct rkisp1_ext_params_block_header header; + struct rkisp1_cif_isp_compand_bls_config config; +} __attribute__((aligned(8))); + +/** + * struct rkisp1_ext_params_compand_curve_config - RkISP1 extensible params + * Compand curve config + * + * RkISP1 extensible parameters Companding configuration block (expand and + * compression curves). Identified by + * :c:type:`RKISP1_EXT_PARAMS_BLOCK_TYPE_COMPAND_EXPAND` or + * :c:type:`RKISP1_EXT_PARAMS_BLOCK_TYPE_COMPAND_COMPRESS`. + * + * @header: The RkISP1 extensible parameters header, see + * :c:type:`rkisp1_ext_params_block_header` + * @config: Companding curve configuration, see + * :c:type:`rkisp1_cif_isp_compand_curve_config` + */ +struct rkisp1_ext_params_compand_curve_config { + struct rkisp1_ext_params_block_header header; + struct rkisp1_cif_isp_compand_curve_config config; +} __attribute__((aligned(8))); + +/* + * The rkisp1_ext_params_compand_curve_config structure is counted twice as it + * is used for both the COMPAND_EXPAND and COMPAND_COMPRESS block types. + */ #define RKISP1_EXT_PARAMS_MAX_SIZE \ (sizeof(struct rkisp1_ext_params_bls_config) +\ sizeof(struct rkisp1_ext_params_dpcc_config) +\ @@ -1397,7 +1481,10 @@ struct rkisp1_ext_params_afc_config { sizeof(struct rkisp1_ext_params_awb_meas_config) +\ sizeof(struct rkisp1_ext_params_hst_config) +\ sizeof(struct rkisp1_ext_params_aec_config) +\ - sizeof(struct rkisp1_ext_params_afc_config)) + sizeof(struct rkisp1_ext_params_afc_config) +\ + sizeof(struct rkisp1_ext_params_compand_bls_config) +\ + sizeof(struct rkisp1_ext_params_compand_curve_config) +\ + sizeof(struct rkisp1_ext_params_compand_curve_config)) /** * enum rksip1_ext_param_buffer_version - RkISP1 extensible parameters version -- cgit v1.2.3 From 216203bdc2280d8fc5baf60707eee2051de1426e Mon Sep 17 00:00:00 2001 From: "Gustavo A. R. Silva" Date: Tue, 13 Aug 2024 16:15:02 -0600 Subject: UAPI: net/sched: Use __struct_group() in flex struct tc_u32_sel Use the `__struct_group()` helper to create a new tagged `struct tc_u32_sel_hdr`. This structure groups together all the members of the flexible `struct tc_u32_sel` except the flexible array. As a result, the array is effectively separated from the rest of the members without modifying the memory layout of the flexible structure. This new tagged struct will be used to fix problematic declarations of middle-flex-arrays in composite structs[1]. [1] https://git.kernel.org/linus/d88cabfd9abc Signed-off-by: Gustavo A. R. Silva Link: https://patch.msgid.link/e59fe833564ddc5b2cc83056a4c504be887d6193.1723586870.git.gustavoars@kernel.org Signed-off-by: Jakub Kicinski --- include/uapi/linux/pkt_cls.h | 23 +++++++++++++---------- 1 file changed, 13 insertions(+), 10 deletions(-) (limited to 'include/uapi') diff --git a/include/uapi/linux/pkt_cls.h b/include/uapi/linux/pkt_cls.h index d36d9cdf0c00..2c32080416b5 100644 --- a/include/uapi/linux/pkt_cls.h +++ b/include/uapi/linux/pkt_cls.h @@ -246,16 +246,19 @@ struct tc_u32_key { }; struct tc_u32_sel { - unsigned char flags; - unsigned char offshift; - unsigned char nkeys; - - __be16 offmask; - __u16 off; - short offoff; - - short hoff; - __be32 hmask; + /* New members MUST be added within the __struct_group() macro below. */ + __struct_group(tc_u32_sel_hdr, hdr, /* no attrs */, + unsigned char flags; + unsigned char offshift; + unsigned char nkeys; + + __be16 offmask; + __u16 off; + short offoff; + + short hoff; + __be32 hmask; + ); struct tc_u32_key keys[]; }; -- cgit v1.2.3 From 2140e63cd87fa707acf514d594725097bed018fd Mon Sep 17 00:00:00 2001 From: Oleksij Rempel Date: Mon, 12 Aug 2024 09:30:44 +0200 Subject: ethtool: Add new result codes for TDR diagnostics Add new result codes to support TDR diagnostics in preparation for Open Alliance 1000BaseT1 TDR support: - ETHTOOL_A_CABLE_RESULT_CODE_NOISE: TDR not possible due to high noise level. - ETHTOOL_A_CABLE_RESULT_CODE_RESOLUTION_NOT_POSSIBLE: TDR resolution not possible / out of distance. Reviewed-by: Andrew Lunn Signed-off-by: Oleksij Rempel Link: https://patch.msgid.link/20240812073046.1728288-1-o.rempel@pengutronix.de Signed-off-by: Jakub Kicinski --- include/uapi/linux/ethtool_netlink.h | 4 ++++ 1 file changed, 4 insertions(+) (limited to 'include/uapi') diff --git a/include/uapi/linux/ethtool_netlink.h b/include/uapi/linux/ethtool_netlink.h index 93c57525a975..9074fa309bd6 100644 --- a/include/uapi/linux/ethtool_netlink.h +++ b/include/uapi/linux/ethtool_netlink.h @@ -556,6 +556,10 @@ enum { * a regular 100 Ohm cable and a part with the abnormal impedance value */ ETHTOOL_A_CABLE_RESULT_CODE_IMPEDANCE_MISMATCH, + /* TDR not possible due to high noise level */ + ETHTOOL_A_CABLE_RESULT_CODE_NOISE, + /* TDR resolution not possible / out of distance */ + ETHTOOL_A_CABLE_RESULT_CODE_RESOLUTION_NOT_POSSIBLE, }; enum { -- cgit v1.2.3 From 37745918e0e7575bc40f38da93a99b9fa6406224 Mon Sep 17 00:00:00 2001 From: Ivan Orlov Date: Tue, 13 Aug 2024 13:07:00 +0100 Subject: ALSA: timer: Introduce virtual userspace-driven timers Implement two ioctl calls in order to support virtual userspace-driven ALSA timers. The first ioctl is SNDRV_TIMER_IOCTL_CREATE, which gets the snd_timer_uinfo struct as a parameter and puts a file descriptor of a virtual timer into the `fd` field of the snd_timer_unfo structure. It also updates the `id` field of the snd_timer_uinfo struct, which provides a unique identifier for the timer (basically, the subdevice number which can be used when creating timer instances). This patch also introduces a tiny id allocator for the userspace-driven timers, which guarantees that we don't have more than 128 of them in the system. Another ioctl is SNDRV_TIMER_IOCTL_TRIGGER, which allows us to trigger the virtual timer (and calls snd_timer_interrupt for the timer under the hood), causing all of the timer instances binded to this timer to execute their callbacks. The maximum amount of ticks available for the timer is 1 for the sake of simplicity of the userspace API. 'start', 'stop', 'open' and 'close' callbacks for the userspace-driven timers are empty since we don't really do any hardware initialization here. Suggested-by: Axel Holzinger Signed-off-by: Ivan Orlov Signed-off-by: Takashi Iwai Link: https://patch.msgid.link/20240813120701.171743-4-ivan.orlov0322@gmail.com --- include/uapi/sound/asound.h | 17 +++- sound/core/Kconfig | 10 ++ sound/core/timer.c | 225 ++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 251 insertions(+), 1 deletion(-) (limited to 'include/uapi') diff --git a/include/uapi/sound/asound.h b/include/uapi/sound/asound.h index 8bf7e8a0eb6f..4cd513215bcd 100644 --- a/include/uapi/sound/asound.h +++ b/include/uapi/sound/asound.h @@ -869,7 +869,7 @@ struct snd_ump_block_info { * Timer section - /dev/snd/timer */ -#define SNDRV_TIMER_VERSION SNDRV_PROTOCOL_VERSION(2, 0, 7) +#define SNDRV_TIMER_VERSION SNDRV_PROTOCOL_VERSION(2, 0, 8) enum { SNDRV_TIMER_CLASS_NONE = -1, @@ -894,6 +894,7 @@ enum { #define SNDRV_TIMER_GLOBAL_RTC 1 /* unused */ #define SNDRV_TIMER_GLOBAL_HPET 2 #define SNDRV_TIMER_GLOBAL_HRTIMER 3 +#define SNDRV_TIMER_GLOBAL_UDRIVEN 4 /* info flags */ #define SNDRV_TIMER_FLG_SLAVE (1<<0) /* cannot be controlled */ @@ -974,6 +975,18 @@ struct snd_timer_status { }; #endif +/* + * This structure describes the userspace-driven timer. Such timers are purely virtual, + * and can only be triggered from software (for instance, by userspace application). + */ +struct snd_timer_uinfo { + /* To pretend being a normal timer, we need to know the resolution in ns. */ + __u64 resolution; + int fd; + unsigned int id; + unsigned char reserved[16]; +}; + #define SNDRV_TIMER_IOCTL_PVERSION _IOR('T', 0x00, int) #define SNDRV_TIMER_IOCTL_NEXT_DEVICE _IOWR('T', 0x01, struct snd_timer_id) #define SNDRV_TIMER_IOCTL_TREAD_OLD _IOW('T', 0x02, int) @@ -990,6 +1003,8 @@ struct snd_timer_status { #define SNDRV_TIMER_IOCTL_CONTINUE _IO('T', 0xa2) #define SNDRV_TIMER_IOCTL_PAUSE _IO('T', 0xa3) #define SNDRV_TIMER_IOCTL_TREAD64 _IOW('T', 0xa4, int) +#define SNDRV_TIMER_IOCTL_CREATE _IOWR('T', 0xa5, struct snd_timer_uinfo) +#define SNDRV_TIMER_IOCTL_TRIGGER _IO('T', 0xa6) #if __BITS_PER_LONG == 64 #define SNDRV_TIMER_IOCTL_TREAD SNDRV_TIMER_IOCTL_TREAD_OLD diff --git a/sound/core/Kconfig b/sound/core/Kconfig index f0ba87d4e504..2c5b9f964703 100644 --- a/sound/core/Kconfig +++ b/sound/core/Kconfig @@ -242,6 +242,16 @@ config SND_JACK_INJECTION_DEBUG Say Y if you are debugging via jack injection interface. If unsure select "N". +config SND_UTIMER + bool "Enable support for userspace-controlled virtual timers" + depends on SND_TIMER + help + Say Y to enable the support of userspace-controlled timers. These + timers are purely virtual, and they are supposed to be triggered + from userspace. They could be quite useful when synchronizing the + sound timing with userspace applications (for instance, when sending + data through snd-aloop). + config SND_VMASTER bool diff --git a/sound/core/timer.c b/sound/core/timer.c index 71a07c1662f5..7610f102360d 100644 --- a/sound/core/timer.c +++ b/sound/core/timer.c @@ -13,6 +13,8 @@ #include #include #include +#include +#include #include #include #include @@ -109,6 +111,16 @@ struct snd_timer_status64 { unsigned char reserved[64]; /* reserved */ }; +#ifdef CONFIG_SND_UTIMER +#define SNDRV_UTIMERS_MAX_COUNT 128 +/* Internal data structure for keeping the state of the userspace-driven timer */ +struct snd_utimer { + char *name; + struct snd_timer *timer; + unsigned int id; +}; +#endif + #define SNDRV_TIMER_IOCTL_STATUS64 _IOR('T', 0x14, struct snd_timer_status64) /* list of timers */ @@ -2009,6 +2021,217 @@ enum { SNDRV_TIMER_IOCTL_PAUSE_OLD = _IO('T', 0x23), }; +#ifdef CONFIG_SND_UTIMER +/* + * Since userspace-driven timers are passed to userspace, we need to have an identifier + * which will allow us to use them (basically, the subdevice number of udriven timer). + */ +static DEFINE_IDA(snd_utimer_ids); + +static void snd_utimer_put_id(struct snd_utimer *utimer) +{ + int timer_id = utimer->id; + + snd_BUG_ON(timer_id < 0 || timer_id >= SNDRV_UTIMERS_MAX_COUNT); + ida_free(&snd_utimer_ids, timer_id); +} + +static int snd_utimer_take_id(void) +{ + return ida_alloc_max(&snd_utimer_ids, SNDRV_UTIMERS_MAX_COUNT - 1, GFP_KERNEL); +} + +static void snd_utimer_free(struct snd_utimer *utimer) +{ + snd_timer_free(utimer->timer); + snd_utimer_put_id(utimer); + kfree(utimer->name); + kfree(utimer); +} + +static int snd_utimer_release(struct inode *inode, struct file *file) +{ + struct snd_utimer *utimer = (struct snd_utimer *)file->private_data; + + snd_utimer_free(utimer); + return 0; +} + +static int snd_utimer_trigger(struct file *file) +{ + struct snd_utimer *utimer = (struct snd_utimer *)file->private_data; + + snd_timer_interrupt(utimer->timer, utimer->timer->sticks); + return 0; +} + +static long snd_utimer_ioctl(struct file *file, unsigned int ioctl, unsigned long arg) +{ + switch (ioctl) { + case SNDRV_TIMER_IOCTL_TRIGGER: + return snd_utimer_trigger(file); + } + + return -ENOTTY; +} + +static const struct file_operations snd_utimer_fops = { + .llseek = noop_llseek, + .release = snd_utimer_release, + .unlocked_ioctl = snd_utimer_ioctl, +}; + +static int snd_utimer_start(struct snd_timer *t) +{ + return 0; +} + +static int snd_utimer_stop(struct snd_timer *t) +{ + return 0; +} + +static int snd_utimer_open(struct snd_timer *t) +{ + return 0; +} + +static int snd_utimer_close(struct snd_timer *t) +{ + return 0; +} + +static const struct snd_timer_hardware timer_hw = { + .flags = SNDRV_TIMER_HW_AUTO | SNDRV_TIMER_HW_WORK, + .open = snd_utimer_open, + .close = snd_utimer_close, + .start = snd_utimer_start, + .stop = snd_utimer_stop, +}; + +static int snd_utimer_create(struct snd_timer_uinfo *utimer_info, + struct snd_utimer **r_utimer) +{ + struct snd_utimer *utimer; + struct snd_timer *timer; + struct snd_timer_id tid; + int utimer_id; + int err = 0; + + if (!utimer_info || utimer_info->resolution == 0) + return -EINVAL; + + utimer = kzalloc(sizeof(*utimer), GFP_KERNEL); + if (!utimer) + return -ENOMEM; + + /* We hold the ioctl lock here so we won't get a race condition when allocating id */ + utimer_id = snd_utimer_take_id(); + if (utimer_id < 0) { + err = utimer_id; + goto err_take_id; + } + + utimer->name = kasprintf(GFP_KERNEL, "snd-utimer%d", utimer_id); + if (!utimer->name) { + err = -ENOMEM; + goto err_get_name; + } + + utimer->id = utimer_id; + + tid.dev_sclass = SNDRV_TIMER_SCLASS_APPLICATION; + tid.dev_class = SNDRV_TIMER_CLASS_GLOBAL; + tid.card = -1; + tid.device = SNDRV_TIMER_GLOBAL_UDRIVEN; + tid.subdevice = utimer_id; + + err = snd_timer_new(NULL, utimer->name, &tid, &timer); + if (err < 0) { + pr_err("Can't create userspace-driven timer\n"); + goto err_timer_new; + } + + timer->module = THIS_MODULE; + timer->hw = timer_hw; + timer->hw.resolution = utimer_info->resolution; + timer->hw.ticks = 1; + timer->max_instances = MAX_SLAVE_INSTANCES; + + utimer->timer = timer; + + err = snd_timer_global_register(timer); + if (err < 0) { + pr_err("Can't register a userspace-driven timer\n"); + goto err_timer_reg; + } + + *r_utimer = utimer; + return 0; + +err_timer_reg: + snd_timer_free(timer); +err_timer_new: + kfree(utimer->name); +err_get_name: + snd_utimer_put_id(utimer); +err_take_id: + kfree(utimer); + + return err; +} + +static int snd_utimer_ioctl_create(struct file *file, + struct snd_timer_uinfo __user *_utimer_info) +{ + struct snd_utimer *utimer; + struct snd_timer_uinfo *utimer_info __free(kfree) = NULL; + int err, timer_fd; + + utimer_info = memdup_user(_utimer_info, sizeof(*utimer_info)); + if (IS_ERR(utimer_info)) + return PTR_ERR(no_free_ptr(utimer_info)); + + err = snd_utimer_create(utimer_info, &utimer); + if (err < 0) + return err; + + utimer_info->id = utimer->id; + + timer_fd = anon_inode_getfd(utimer->name, &snd_utimer_fops, utimer, O_RDWR | O_CLOEXEC); + if (timer_fd < 0) { + snd_utimer_free(utimer); + return timer_fd; + } + + utimer_info->fd = timer_fd; + + err = copy_to_user(_utimer_info, utimer_info, sizeof(*utimer_info)); + if (err) { + /* + * "Leak" the fd, as there is nothing we can do about it. + * It might have been closed already since anon_inode_getfd + * makes it available for userspace. + * + * We have to rely on the process exit path to do any + * necessary cleanup (e.g. releasing the file). + */ + return -EFAULT; + } + + return 0; +} + +#else + +static int snd_utimer_ioctl_create(struct file *file, + struct snd_timer_uinfo __user *_utimer_info) +{ + return -ENOTTY; +} + +#endif + static long __snd_timer_user_ioctl(struct file *file, unsigned int cmd, unsigned long arg, bool compat) { @@ -2053,6 +2276,8 @@ static long __snd_timer_user_ioctl(struct file *file, unsigned int cmd, case SNDRV_TIMER_IOCTL_PAUSE: case SNDRV_TIMER_IOCTL_PAUSE_OLD: return snd_timer_user_pause(file); + case SNDRV_TIMER_IOCTL_CREATE: + return snd_utimer_ioctl_create(file, argp); } return -ENOTTY; } -- cgit v1.2.3 From 820a185896b77814557302b981b092a9e7b36814 Mon Sep 17 00:00:00 2001 From: Christian Brauner Date: Wed, 24 Jul 2024 15:15:35 +0200 Subject: fcntl: add F_CREATED_QUERY Systemd has a helper called openat_report_new() that returns whether a file was created anew or it already existed before for cases where O_CREAT has to be used without O_EXCL (cf. [1]). That apparently isn't something that's specific to systemd but it's where I noticed it. The current logic is that it first attempts to open the file without O_CREAT | O_EXCL and if it gets ENOENT the helper tries again with both flags. If that succeeds all is well. If it now reports EEXIST it retries. That works fairly well but some corner cases make this more involved. If this operates on a dangling symlink the first openat() without O_CREAT | O_EXCL will return ENOENT but the second openat() with O_CREAT | O_EXCL will fail with EEXIST. The reason is that openat() without O_CREAT | O_EXCL follows the symlink while O_CREAT | O_EXCL doesn't for security reasons. So it's not something we can really change unless we add an explicit opt-in via O_FOLLOW which seems really ugly. The caller could try and use fanotify() to register to listen for creation events in the directory before calling openat(). The caller could then compare the returned tid to its own tid to ensure that even in threaded environments it actually created the file. That might work but is a lot of work for something that should be fairly simple and I'm uncertain about it's reliability. The caller could use a bpf lsm hook to hook into security_file_open() to figure out whether they created the file. That also seems a bit wild. So let's add F_CREATED_QUERY which allows the caller to check whether they actually did create the file. That has caveats of course but I don't think they are problematic: * In multi-threaded environments a thread can only be sure that it did create the file if it calls openat() with O_CREAT. In other words, it's obviously not enough to just go through it's fdtable and check these fds because another thread could've created the file. * If there's any codepaths where an openat() with O_CREAT would yield the same struct file as that of another thread it would obviously cause wrong results. I'm not aware of any such codepaths from openat() itself. Imho, that would be a bug. * Related to the previous point, calling the new fcntl() on files created and opened via special-purpose system calls or ioctl()s would cause wrong results only if the affected subsystem a) raises FMODE_CREATED and b) may return the same struct file for two different calls. I'm not seeing anything outside of regular VFS code that raises FMODE_CREATED. There is code for b) in e.g., the drm layer where the same struct file is resurfaced but again FMODE_CREATED isn't used and it would be very misleading if it did. Link: https://github.com/systemd/systemd/blob/11d5e2b5fbf9f6bfa5763fd45b56829ad4f0777f/src/basic/fs-util.c#L1078 [1] Link: https://lore.kernel.org/r/20240724-work-fcntl-v1-1-e8153a2f1991@kernel.org Reviewed-by: Jeff Layton Reviewed-by: Jan Kara Signed-off-by: Christian Brauner --- fs/fcntl.c | 10 ++++++++++ include/uapi/linux/fcntl.h | 3 +++ 2 files changed, 13 insertions(+) (limited to 'include/uapi') diff --git a/fs/fcntl.c b/fs/fcntl.c index 300e5d9ad913..22ec683ad8f8 100644 --- a/fs/fcntl.c +++ b/fs/fcntl.c @@ -343,6 +343,12 @@ static long f_dupfd_query(int fd, struct file *filp) return f.file == filp; } +/* Let the caller figure out whether a given file was just created. */ +static long f_created_query(const struct file *filp) +{ + return !!(filp->f_mode & FMODE_CREATED); +} + static long do_fcntl(int fd, unsigned int cmd, unsigned long arg, struct file *filp) { @@ -352,6 +358,9 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg, long err = -EINVAL; switch (cmd) { + case F_CREATED_QUERY: + err = f_created_query(filp); + break; case F_DUPFD: err = f_dupfd(argi, filp, 0); break; @@ -463,6 +472,7 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg, static int check_fcntl_cmd(unsigned cmd) { switch (cmd) { + case F_CREATED_QUERY: case F_DUPFD: case F_DUPFD_CLOEXEC: case F_DUPFD_QUERY: diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h index c0bcc185fa48..e55a3314bcb0 100644 --- a/include/uapi/linux/fcntl.h +++ b/include/uapi/linux/fcntl.h @@ -16,6 +16,9 @@ #define F_DUPFD_QUERY (F_LINUX_SPECIFIC_BASE + 3) +/* Was the file just created? */ +#define F_CREATED_QUERY (F_LINUX_SPECIFIC_BASE + 4) + /* * Cancel a blocking posix lock; internal use only until we expose an * asynchronous lock api to userspace: -- cgit v1.2.3 From 0311507792b54069ac72e0a6c6b35c5d40aadad8 Mon Sep 17 00:00:00 2001 From: Deven Bowers Date: Fri, 2 Aug 2024 23:08:15 -0700 Subject: lsm: add IPE lsm Integrity Policy Enforcement (IPE) is an LSM that provides an complimentary approach to Mandatory Access Control than existing LSMs today. Existing LSMs have centered around the concept of access to a resource should be controlled by the current user's credentials. IPE's approach, is that access to a resource should be controlled by the system's trust of a current resource. The basis of this approach is defining a global policy to specify which resource can be trusted. Signed-off-by: Deven Bowers Signed-off-by: Fan Wu [PM: subject line tweak] Signed-off-by: Paul Moore --- include/uapi/linux/lsm.h | 1 + security/Kconfig | 11 +++--- security/Makefile | 1 + security/ipe/Kconfig | 17 +++++++++ security/ipe/Makefile | 9 +++++ security/ipe/ipe.c | 42 ++++++++++++++++++++++ security/ipe/ipe.h | 16 +++++++++ security/security.c | 3 +- .../testing/selftests/lsm/lsm_list_modules_test.c | 3 ++ 9 files changed, 97 insertions(+), 6 deletions(-) create mode 100644 security/ipe/Kconfig create mode 100644 security/ipe/Makefile create mode 100644 security/ipe/ipe.c create mode 100644 security/ipe/ipe.h (limited to 'include/uapi') diff --git a/include/uapi/linux/lsm.h b/include/uapi/linux/lsm.h index 33d8c9f4aa6b..938593dfd5da 100644 --- a/include/uapi/linux/lsm.h +++ b/include/uapi/linux/lsm.h @@ -64,6 +64,7 @@ struct lsm_ctx { #define LSM_ID_LANDLOCK 110 #define LSM_ID_IMA 111 #define LSM_ID_EVM 112 +#define LSM_ID_IPE 113 /* * LSM_ATTR_XXX definitions identify different LSM attributes diff --git a/security/Kconfig b/security/Kconfig index 412e76f1575d..9fb8f9b14972 100644 --- a/security/Kconfig +++ b/security/Kconfig @@ -192,6 +192,7 @@ source "security/yama/Kconfig" source "security/safesetid/Kconfig" source "security/lockdown/Kconfig" source "security/landlock/Kconfig" +source "security/ipe/Kconfig" source "security/integrity/Kconfig" @@ -231,11 +232,11 @@ endchoice config LSM string "Ordered list of enabled LSMs" - default "landlock,lockdown,yama,loadpin,safesetid,smack,selinux,tomoyo,apparmor,bpf" if DEFAULT_SECURITY_SMACK - default "landlock,lockdown,yama,loadpin,safesetid,apparmor,selinux,smack,tomoyo,bpf" if DEFAULT_SECURITY_APPARMOR - default "landlock,lockdown,yama,loadpin,safesetid,tomoyo,bpf" if DEFAULT_SECURITY_TOMOYO - default "landlock,lockdown,yama,loadpin,safesetid,bpf" if DEFAULT_SECURITY_DAC - default "landlock,lockdown,yama,loadpin,safesetid,selinux,smack,tomoyo,apparmor,bpf" + default "landlock,lockdown,yama,loadpin,safesetid,smack,selinux,tomoyo,apparmor,ipe,bpf" if DEFAULT_SECURITY_SMACK + default "landlock,lockdown,yama,loadpin,safesetid,apparmor,selinux,smack,tomoyo,ipe,bpf" if DEFAULT_SECURITY_APPARMOR + default "landlock,lockdown,yama,loadpin,safesetid,tomoyo,ipe,bpf" if DEFAULT_SECURITY_TOMOYO + default "landlock,lockdown,yama,loadpin,safesetid,ipe,bpf" if DEFAULT_SECURITY_DAC + default "landlock,lockdown,yama,loadpin,safesetid,selinux,smack,tomoyo,apparmor,ipe,bpf" help A comma-separated list of LSMs, in initialization order. Any LSMs left off this list, except for those with order diff --git a/security/Makefile b/security/Makefile index 59f238490665..cc0982214b84 100644 --- a/security/Makefile +++ b/security/Makefile @@ -25,6 +25,7 @@ obj-$(CONFIG_SECURITY_LOCKDOWN_LSM) += lockdown/ obj-$(CONFIG_CGROUPS) += device_cgroup.o obj-$(CONFIG_BPF_LSM) += bpf/ obj-$(CONFIG_SECURITY_LANDLOCK) += landlock/ +obj-$(CONFIG_SECURITY_IPE) += ipe/ # Object integrity file lists obj-$(CONFIG_INTEGRITY) += integrity/ diff --git a/security/ipe/Kconfig b/security/ipe/Kconfig new file mode 100644 index 000000000000..e4875fb04883 --- /dev/null +++ b/security/ipe/Kconfig @@ -0,0 +1,17 @@ +# SPDX-License-Identifier: GPL-2.0-only +# +# Integrity Policy Enforcement (IPE) configuration +# + +menuconfig SECURITY_IPE + bool "Integrity Policy Enforcement (IPE)" + depends on SECURITY && SECURITYFS + select PKCS7_MESSAGE_PARSER + select SYSTEM_DATA_VERIFICATION + help + This option enables the Integrity Policy Enforcement LSM + allowing users to define a policy to enforce a trust-based access + control. A key feature of IPE is a customizable policy to allow + admins to reconfigure trust requirements on the fly. + + If unsure, answer N. diff --git a/security/ipe/Makefile b/security/ipe/Makefile new file mode 100644 index 000000000000..5486398a69e9 --- /dev/null +++ b/security/ipe/Makefile @@ -0,0 +1,9 @@ +# SPDX-License-Identifier: GPL-2.0 +# +# Copyright (C) 2020-2024 Microsoft Corporation. All rights reserved. +# +# Makefile for building the IPE module as part of the kernel tree. +# + +obj-$(CONFIG_SECURITY_IPE) += \ + ipe.o \ diff --git a/security/ipe/ipe.c b/security/ipe/ipe.c new file mode 100644 index 000000000000..8d4ea372873e --- /dev/null +++ b/security/ipe/ipe.c @@ -0,0 +1,42 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (C) 2020-2024 Microsoft Corporation. All rights reserved. + */ +#include + +#include "ipe.h" + +static struct lsm_blob_sizes ipe_blobs __ro_after_init = { +}; + +static const struct lsm_id ipe_lsmid = { + .name = "ipe", + .id = LSM_ID_IPE, +}; + +static struct security_hook_list ipe_hooks[] __ro_after_init = { +}; + +/** + * ipe_init() - Entry point of IPE. + * + * This is called at LSM init, which happens occurs early during kernel + * start up. During this phase, IPE registers its hooks and loads the + * builtin boot policy. + * + * Return: + * * %0 - OK + * * %-ENOMEM - Out of memory (OOM) + */ +static int __init ipe_init(void) +{ + security_add_hooks(ipe_hooks, ARRAY_SIZE(ipe_hooks), &ipe_lsmid); + + return 0; +} + +DEFINE_LSM(ipe) = { + .name = "ipe", + .init = ipe_init, + .blobs = &ipe_blobs, +}; diff --git a/security/ipe/ipe.h b/security/ipe/ipe.h new file mode 100644 index 000000000000..adc3c45e9f53 --- /dev/null +++ b/security/ipe/ipe.h @@ -0,0 +1,16 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (C) 2020-2024 Microsoft Corporation. All rights reserved. + */ + +#ifndef _IPE_H +#define _IPE_H + +#ifdef pr_fmt +#undef pr_fmt +#endif +#define pr_fmt(fmt) "ipe: " fmt + +#include + +#endif /* _IPE_H */ diff --git a/security/security.c b/security/security.c index 611d3c124ba6..645a660320cb 100644 --- a/security/security.c +++ b/security/security.c @@ -53,7 +53,8 @@ (IS_ENABLED(CONFIG_BPF_LSM) ? 1 : 0) + \ (IS_ENABLED(CONFIG_SECURITY_LANDLOCK) ? 1 : 0) + \ (IS_ENABLED(CONFIG_IMA) ? 1 : 0) + \ - (IS_ENABLED(CONFIG_EVM) ? 1 : 0)) + (IS_ENABLED(CONFIG_EVM) ? 1 : 0) + \ + (IS_ENABLED(CONFIG_SECURITY_IPE) ? 1 : 0)) /* * These are descriptions of the reasons that can be passed to the diff --git a/tools/testing/selftests/lsm/lsm_list_modules_test.c b/tools/testing/selftests/lsm/lsm_list_modules_test.c index 06d24d4679a6..1cc8a977c711 100644 --- a/tools/testing/selftests/lsm/lsm_list_modules_test.c +++ b/tools/testing/selftests/lsm/lsm_list_modules_test.c @@ -128,6 +128,9 @@ TEST(correct_lsm_list_modules) case LSM_ID_EVM: name = "evm"; break; + case LSM_ID_IPE: + name = "ipe"; + break; default: name = "INVALID"; break; -- cgit v1.2.3 From d386d59b7c1a39112ca875327339ed519df2b96c Mon Sep 17 00:00:00 2001 From: Wen Gu Date: Wed, 14 Aug 2024 21:08:26 +0800 Subject: net/smc: introduce statistics for allocated ringbufs of link group Currently we have the statistics on sndbuf/RMB sizes of all connections that have ever been on the link group, namely smc_stats_memsize. However these statistics are incremental and since the ringbufs of link group are allowed to be reused, we cannot know the actual allocated buffers through these. So here introduces the statistic on actual allocated ringbufs of the link group, it will be incremented when a new ringbuf is added into buf_list and decremented when it is deleted from buf_list. Signed-off-by: Wen Gu Signed-off-by: Paolo Abeni --- include/uapi/linux/smc.h | 4 ++++ net/smc/smc_core.c | 46 ++++++++++++++++++++++++++++++++++++++++++---- net/smc/smc_core.h | 2 ++ 3 files changed, 48 insertions(+), 4 deletions(-) (limited to 'include/uapi') diff --git a/include/uapi/linux/smc.h b/include/uapi/linux/smc.h index b531e3ef011a..9b74ef79070a 100644 --- a/include/uapi/linux/smc.h +++ b/include/uapi/linux/smc.h @@ -127,6 +127,8 @@ enum { SMC_NLA_LGR_R_NET_COOKIE, /* u64 */ SMC_NLA_LGR_R_PAD, /* flag */ SMC_NLA_LGR_R_BUF_TYPE, /* u8 */ + SMC_NLA_LGR_R_SNDBUF_ALLOC, /* uint */ + SMC_NLA_LGR_R_RMB_ALLOC, /* uint */ __SMC_NLA_LGR_R_MAX, SMC_NLA_LGR_R_MAX = __SMC_NLA_LGR_R_MAX - 1 }; @@ -162,6 +164,8 @@ enum { SMC_NLA_LGR_D_V2_COMMON, /* nest */ SMC_NLA_LGR_D_EXT_GID, /* u64 */ SMC_NLA_LGR_D_PEER_EXT_GID, /* u64 */ + SMC_NLA_LGR_D_SNDBUF_ALLOC, /* uint */ + SMC_NLA_LGR_D_DMB_ALLOC, /* uint */ __SMC_NLA_LGR_D_MAX, SMC_NLA_LGR_D_MAX = __SMC_NLA_LGR_D_MAX - 1 }; diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c index 71fb334d8234..8dcf1c7f1526 100644 --- a/net/smc/smc_core.c +++ b/net/smc/smc_core.c @@ -221,6 +221,35 @@ static void smc_lgr_unregister_conn(struct smc_connection *conn) write_unlock_bh(&lgr->conns_lock); } +static void smc_lgr_buf_list_add(struct smc_link_group *lgr, + bool is_rmb, + struct list_head *buf_list, + struct smc_buf_desc *buf_desc) +{ + list_add(&buf_desc->list, buf_list); + if (is_rmb) { + lgr->alloc_rmbs += buf_desc->len; + lgr->alloc_rmbs += + lgr->is_smcd ? sizeof(struct smcd_cdc_msg) : 0; + } else { + lgr->alloc_sndbufs += buf_desc->len; + } +} + +static void smc_lgr_buf_list_del(struct smc_link_group *lgr, + bool is_rmb, + struct smc_buf_desc *buf_desc) +{ + list_del(&buf_desc->list); + if (is_rmb) { + lgr->alloc_rmbs -= buf_desc->len; + lgr->alloc_rmbs -= + lgr->is_smcd ? sizeof(struct smcd_cdc_msg) : 0; + } else { + lgr->alloc_sndbufs -= buf_desc->len; + } +} + int smc_nl_get_sys_info(struct sk_buff *skb, struct netlink_callback *cb) { struct smc_nl_dmp_ctx *cb_ctx = smc_nl_dmp_ctx(cb); @@ -363,6 +392,10 @@ static int smc_nl_fill_lgr(struct smc_link_group *lgr, smc_target[SMC_MAX_PNETID_LEN] = 0; if (nla_put_string(skb, SMC_NLA_LGR_R_PNETID, smc_target)) goto errattr; + if (nla_put_uint(skb, SMC_NLA_LGR_R_SNDBUF_ALLOC, lgr->alloc_sndbufs)) + goto errattr; + if (nla_put_uint(skb, SMC_NLA_LGR_R_RMB_ALLOC, lgr->alloc_rmbs)) + goto errattr; if (lgr->smc_version > SMC_V1) { v2_attrs = nla_nest_start(skb, SMC_NLA_LGR_R_V2_COMMON); if (!v2_attrs) @@ -541,6 +574,10 @@ static int smc_nl_fill_smcd_lgr(struct smc_link_group *lgr, goto errattr; if (nla_put_u32(skb, SMC_NLA_LGR_D_CHID, smc_ism_get_chid(lgr->smcd))) goto errattr; + if (nla_put_uint(skb, SMC_NLA_LGR_D_SNDBUF_ALLOC, lgr->alloc_sndbufs)) + goto errattr; + if (nla_put_uint(skb, SMC_NLA_LGR_D_DMB_ALLOC, lgr->alloc_rmbs)) + goto errattr; memcpy(smc_pnet, lgr->smcd->pnetid, SMC_MAX_PNETID_LEN); smc_pnet[SMC_MAX_PNETID_LEN] = 0; if (nla_put_string(skb, SMC_NLA_LGR_D_PNETID, smc_pnet)) @@ -1138,7 +1175,7 @@ static void smcr_buf_unuse(struct smc_buf_desc *buf_desc, bool is_rmb, lock = is_rmb ? &lgr->rmbs_lock : &lgr->sndbufs_lock; down_write(lock); - list_del(&buf_desc->list); + smc_lgr_buf_list_del(lgr, is_rmb, buf_desc); up_write(lock); smc_buf_free(lgr, is_rmb, buf_desc); @@ -1377,7 +1414,7 @@ static void __smc_lgr_free_bufs(struct smc_link_group *lgr, bool is_rmb) buf_list = &lgr->sndbufs[i]; list_for_each_entry_safe(buf_desc, bf_desc, buf_list, list) { - list_del(&buf_desc->list); + smc_lgr_buf_list_del(lgr, is_rmb, buf_desc); smc_buf_free(lgr, is_rmb, buf_desc); } } @@ -2414,7 +2451,7 @@ static int __smc_buf_create(struct smc_sock *smc, bool is_smcd, bool is_rmb) SMC_STAT_RMB_SIZE(smc, is_smcd, is_rmb, bufsize); buf_desc->used = 1; down_write(lock); - list_add(&buf_desc->list, buf_list); + smc_lgr_buf_list_add(lgr, is_rmb, buf_list, buf_desc); up_write(lock); break; /* found */ } @@ -2496,7 +2533,8 @@ create_rmb: rc = __smc_buf_create(smc, is_smcd, true); if (rc && smc->conn.sndbuf_desc) { down_write(&smc->conn.lgr->sndbufs_lock); - list_del(&smc->conn.sndbuf_desc->list); + smc_lgr_buf_list_del(smc->conn.lgr, false, + smc->conn.sndbuf_desc); up_write(&smc->conn.lgr->sndbufs_lock); smc_buf_free(smc->conn.lgr, false, smc->conn.sndbuf_desc); smc->conn.sndbuf_desc = NULL; diff --git a/net/smc/smc_core.h b/net/smc/smc_core.h index d93cf51dbd7c..0db4e5f79ac4 100644 --- a/net/smc/smc_core.h +++ b/net/smc/smc_core.h @@ -281,6 +281,8 @@ struct smc_link_group { struct rw_semaphore sndbufs_lock; /* protects tx buffers */ struct list_head rmbs[SMC_RMBE_SIZES]; /* rx buffers */ struct rw_semaphore rmbs_lock; /* protects rx buffers */ + u64 alloc_sndbufs; /* stats of tx buffers */ + u64 alloc_rmbs; /* stats of rx buffers */ u8 id[SMC_LGR_ID_SIZE]; /* unique lgr id */ struct delayed_work free_work; /* delayed freeing of an lgr */ -- cgit v1.2.3 From e0d103542b06c36240e3887edfe49578464866eb Mon Sep 17 00:00:00 2001 From: Wen Gu Date: Wed, 14 Aug 2024 21:08:27 +0800 Subject: net/smc: introduce statistics for ringbufs usage of net namespace The buffer size histograms in smc_stats, namely rx/tx_rmbsize, record the sizes of ringbufs for all connections that have ever appeared in the net namespace. They are incremental and we cannot know the actual ringbufs usage from these. So here introduces statistics for current ringbufs usage of existing smc connections in the net namespace into smc_stats, it will be incremented when new connection uses a ringbuf and decremented when the ringbuf is unused. Signed-off-by: Wen Gu Signed-off-by: Paolo Abeni --- include/uapi/linux/smc.h | 2 ++ net/smc/smc_core.c | 22 +++++++++++++++------- net/smc/smc_stats.c | 6 ++++++ net/smc/smc_stats.h | 28 +++++++++++++++++++--------- 4 files changed, 42 insertions(+), 16 deletions(-) (limited to 'include/uapi') diff --git a/include/uapi/linux/smc.h b/include/uapi/linux/smc.h index 9b74ef79070a..1f58cb0c266b 100644 --- a/include/uapi/linux/smc.h +++ b/include/uapi/linux/smc.h @@ -253,6 +253,8 @@ enum { SMC_NLA_STATS_T_TX_BYTES, /* u64 */ SMC_NLA_STATS_T_RX_CNT, /* u64 */ SMC_NLA_STATS_T_TX_CNT, /* u64 */ + SMC_NLA_STATS_T_RX_RMB_USAGE, /* uint */ + SMC_NLA_STATS_T_TX_RMB_USAGE, /* uint */ __SMC_NLA_STATS_T_MAX, SMC_NLA_STATS_T_MAX = __SMC_NLA_STATS_T_MAX - 1 }; diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c index 8dcf1c7f1526..4e694860ece4 100644 --- a/net/smc/smc_core.c +++ b/net/smc/smc_core.c @@ -1203,22 +1203,30 @@ static void smcd_buf_detach(struct smc_connection *conn) static void smc_buf_unuse(struct smc_connection *conn, struct smc_link_group *lgr) { + struct smc_sock *smc = container_of(conn, struct smc_sock, conn); + bool is_smcd = lgr->is_smcd; + int bufsize; + if (conn->sndbuf_desc) { - if (!lgr->is_smcd && conn->sndbuf_desc->is_vm) { + bufsize = conn->sndbuf_desc->len; + if (!is_smcd && conn->sndbuf_desc->is_vm) { smcr_buf_unuse(conn->sndbuf_desc, false, lgr); } else { - memzero_explicit(conn->sndbuf_desc->cpu_addr, conn->sndbuf_desc->len); + memzero_explicit(conn->sndbuf_desc->cpu_addr, bufsize); WRITE_ONCE(conn->sndbuf_desc->used, 0); } + SMC_STAT_RMB_SIZE(smc, is_smcd, false, false, bufsize); } if (conn->rmb_desc) { - if (!lgr->is_smcd) { + bufsize = conn->rmb_desc->len; + if (!is_smcd) { smcr_buf_unuse(conn->rmb_desc, true, lgr); } else { - memzero_explicit(conn->rmb_desc->cpu_addr, - conn->rmb_desc->len + sizeof(struct smcd_cdc_msg)); + bufsize += sizeof(struct smcd_cdc_msg); + memzero_explicit(conn->rmb_desc->cpu_addr, bufsize); WRITE_ONCE(conn->rmb_desc->used, 0); } + SMC_STAT_RMB_SIZE(smc, is_smcd, true, false, bufsize); } } @@ -2427,7 +2435,7 @@ static int __smc_buf_create(struct smc_sock *smc, bool is_smcd, bool is_rmb) buf_desc = smc_buf_get_slot(bufsize_comp, lock, buf_list); if (buf_desc) { buf_desc->is_dma_need_sync = 0; - SMC_STAT_RMB_SIZE(smc, is_smcd, is_rmb, bufsize); + SMC_STAT_RMB_SIZE(smc, is_smcd, is_rmb, true, bufsize); SMC_STAT_BUF_REUSE(smc, is_smcd, is_rmb); break; /* found reusable slot */ } @@ -2448,7 +2456,7 @@ static int __smc_buf_create(struct smc_sock *smc, bool is_smcd, bool is_rmb) } SMC_STAT_RMB_ALLOC(smc, is_smcd, is_rmb); - SMC_STAT_RMB_SIZE(smc, is_smcd, is_rmb, bufsize); + SMC_STAT_RMB_SIZE(smc, is_smcd, is_rmb, true, bufsize); buf_desc->used = 1; down_write(lock); smc_lgr_buf_list_add(lgr, is_rmb, buf_list, buf_desc); diff --git a/net/smc/smc_stats.c b/net/smc/smc_stats.c index ca14c0f3a07d..e71b17d1e21c 100644 --- a/net/smc/smc_stats.c +++ b/net/smc/smc_stats.c @@ -218,6 +218,12 @@ static int smc_nl_fill_stats_tech_data(struct sk_buff *skb, smc_tech->tx_bytes, SMC_NLA_STATS_PAD)) goto errattr; + if (nla_put_uint(skb, SMC_NLA_STATS_T_RX_RMB_USAGE, + smc_tech->rx_rmbuse)) + goto errattr; + if (nla_put_uint(skb, SMC_NLA_STATS_T_TX_RMB_USAGE, + smc_tech->tx_rmbuse)) + goto errattr; if (nla_put_u64_64bit(skb, SMC_NLA_STATS_T_RX_CNT, smc_tech->rx_cnt, SMC_NLA_STATS_PAD)) diff --git a/net/smc/smc_stats.h b/net/smc/smc_stats.h index e19177ce4092..571f9d9e7814 100644 --- a/net/smc/smc_stats.h +++ b/net/smc/smc_stats.h @@ -79,6 +79,8 @@ struct smc_stats_tech { u64 tx_bytes; u64 rx_cnt; u64 tx_cnt; + u64 rx_rmbuse; + u64 tx_rmbuse; }; struct smc_stats { @@ -135,38 +137,46 @@ do { \ } \ while (0) -#define SMC_STAT_RMB_SIZE_SUB(_smc_stats, _tech, k, _len) \ +#define SMC_STAT_RMB_SIZE_SUB(_smc_stats, _tech, k, _is_add, _len) \ do { \ + typeof(_smc_stats) stats = (_smc_stats); \ + typeof(_is_add) is_a = (_is_add); \ typeof(_len) _l = (_len); \ typeof(_tech) t = (_tech); \ int _pos; \ int m = SMC_BUF_MAX - 1; \ if (_l <= 0) \ break; \ - _pos = fls((_l - 1) >> 13); \ - _pos = (_pos <= m) ? _pos : m; \ - this_cpu_inc((*(_smc_stats)).smc[t].k ## _rmbsize.buf[_pos]); \ + if (is_a) { \ + _pos = fls((_l - 1) >> 13); \ + _pos = (_pos <= m) ? _pos : m; \ + this_cpu_inc((*stats).smc[t].k ## _rmbsize.buf[_pos]); \ + this_cpu_add((*stats).smc[t].k ## _rmbuse, _l); \ + } else { \ + this_cpu_sub((*stats).smc[t].k ## _rmbuse, _l); \ + } \ } \ while (0) #define SMC_STAT_RMB_SUB(_smc_stats, type, t, key) \ this_cpu_inc((*(_smc_stats)).smc[t].rmb ## _ ## key.type ## _cnt) -#define SMC_STAT_RMB_SIZE(_smc, _is_smcd, _is_rx, _len) \ +#define SMC_STAT_RMB_SIZE(_smc, _is_smcd, _is_rx, _is_add, _len) \ do { \ struct net *_net = sock_net(&(_smc)->sk); \ struct smc_stats __percpu *_smc_stats = _net->smc.smc_stats; \ + typeof(_is_add) is_add = (_is_add); \ typeof(_is_smcd) is_d = (_is_smcd); \ typeof(_is_rx) is_r = (_is_rx); \ typeof(_len) l = (_len); \ if ((is_d) && (is_r)) \ - SMC_STAT_RMB_SIZE_SUB(_smc_stats, SMC_TYPE_D, rx, l); \ + SMC_STAT_RMB_SIZE_SUB(_smc_stats, SMC_TYPE_D, rx, is_add, l); \ if ((is_d) && !(is_r)) \ - SMC_STAT_RMB_SIZE_SUB(_smc_stats, SMC_TYPE_D, tx, l); \ + SMC_STAT_RMB_SIZE_SUB(_smc_stats, SMC_TYPE_D, tx, is_add, l); \ if (!(is_d) && (is_r)) \ - SMC_STAT_RMB_SIZE_SUB(_smc_stats, SMC_TYPE_R, rx, l); \ + SMC_STAT_RMB_SIZE_SUB(_smc_stats, SMC_TYPE_R, rx, is_add, l); \ if (!(is_d) && !(is_r)) \ - SMC_STAT_RMB_SIZE_SUB(_smc_stats, SMC_TYPE_R, tx, l); \ + SMC_STAT_RMB_SIZE_SUB(_smc_stats, SMC_TYPE_R, tx, is_add, l); \ } \ while (0) -- cgit v1.2.3 From 1fa3314c14c6a20d098991a0a6980f9b18b2f930 Mon Sep 17 00:00:00 2001 From: Ido Schimmel Date: Wed, 14 Aug 2024 15:52:24 +0300 Subject: ipv4: Centralize TOS matching The TOS field in the IPv4 flow information structure ('flowi4_tos') is matched by the kernel against the TOS selector in IPv4 rules and routes. The field is initialized differently by different call sites. Some treat it as DSCP (RFC 2474) and initialize all six DSCP bits, some treat it as RFC 1349 TOS and initialize it using RT_TOS() and some treat it as RFC 791 TOS and initialize it using IPTOS_RT_MASK. What is common to all these call sites is that they all initialize the lower three DSCP bits, which fits the TOS definition in the initial IPv4 specification (RFC 791). Therefore, the kernel only allows configuring IPv4 FIB rules that match on the lower three DSCP bits which are always guaranteed to be initialized by all call sites: # ip -4 rule add tos 0x1c table 100 # ip -4 rule add tos 0x3c table 100 Error: Invalid tos. While this works, it is unlikely to be very useful. RFC 791 that initially defined the TOS and IP precedence fields was updated by RFC 2474 over twenty five years ago where these fields were replaced by a single six bits DSCP field. Extending FIB rules to match on DSCP can be done by adding a new DSCP selector while maintaining the existing semantics of the TOS selector for applications that rely on that. A prerequisite for allowing FIB rules to match on DSCP is to adjust all the call sites to initialize the high order DSCP bits and remove their masking along the path to the core where the field is matched on. However, making this change alone will result in a behavior change. For example, a forwarded IPv4 packet with a DS field of 0xfc will no longer match a FIB rule that was configured with 'tos 0x1c'. This behavior change can be avoided by masking the upper three DSCP bits in 'flowi4_tos' before comparing it against the TOS selectors in FIB rules and routes. Implement the above by adding a new function that checks whether a given DSCP value matches the one specified in the IPv4 flow information structure and invoke it from the three places that currently match on 'flowi4_tos'. Use RT_TOS() for the masking of 'flowi4_tos' instead of IPTOS_RT_MASK since the latter is not uAPI and we should be able to remove it at some point. Include in since the former defines IPTOS_TOS_MASK which is used in the definition of RT_TOS() in . No regressions in FIB tests: # ./fib_tests.sh [...] Tests passed: 218 Tests failed: 0 And FIB rule tests: # ./fib_rule_tests.sh [...] Tests passed: 116 Tests failed: 0 Signed-off-by: Ido Schimmel Signed-off-by: Paolo Abeni --- include/net/ip_fib.h | 6 ++++++ include/uapi/linux/in_route.h | 2 ++ net/ipv4/fib_rules.c | 2 +- net/ipv4/fib_semantics.c | 3 +-- net/ipv4/fib_trie.c | 3 +-- 5 files changed, 11 insertions(+), 5 deletions(-) (limited to 'include/uapi') diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h index 72af2f223e59..269ec10f63e4 100644 --- a/include/net/ip_fib.h +++ b/include/net/ip_fib.h @@ -22,6 +22,7 @@ #include #include #include +#include struct fib_config { u8 fc_dst_len; @@ -434,6 +435,11 @@ static inline bool fib4_rules_early_flow_dissect(struct net *net, #endif /* CONFIG_IP_MULTIPLE_TABLES */ +static inline bool fib_dscp_masked_match(dscp_t dscp, const struct flowi4 *fl4) +{ + return dscp == inet_dsfield_to_dscp(RT_TOS(fl4->flowi4_tos)); +} + /* Exported by fib_frontend.c */ extern const struct nla_policy rtm_ipv4_policy[]; void ip_fib_init(void); diff --git a/include/uapi/linux/in_route.h b/include/uapi/linux/in_route.h index 0cc2c23b47f8..10bdd7e7107f 100644 --- a/include/uapi/linux/in_route.h +++ b/include/uapi/linux/in_route.h @@ -2,6 +2,8 @@ #ifndef _LINUX_IN_ROUTE_H #define _LINUX_IN_ROUTE_H +#include + /* IPv4 routing cache flags */ #define RTCF_DEAD RTNH_F_DEAD diff --git a/net/ipv4/fib_rules.c b/net/ipv4/fib_rules.c index 5bdd1c016009..c26776b71e97 100644 --- a/net/ipv4/fib_rules.c +++ b/net/ipv4/fib_rules.c @@ -186,7 +186,7 @@ INDIRECT_CALLABLE_SCOPE int fib4_rule_match(struct fib_rule *rule, ((daddr ^ r->dst) & r->dstmask)) return 0; - if (r->dscp && r->dscp != inet_dsfield_to_dscp(fl4->flowi4_tos)) + if (r->dscp && !fib_dscp_masked_match(r->dscp, fl4)) return 0; if (rule->ip_proto && (rule->ip_proto != fl4->flowi4_proto)) diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c index 2b57cd2b96e2..0f70341cb8b5 100644 --- a/net/ipv4/fib_semantics.c +++ b/net/ipv4/fib_semantics.c @@ -2066,8 +2066,7 @@ static void fib_select_default(const struct flowi4 *flp, struct fib_result *res) if (fa->fa_slen != slen) continue; - if (fa->fa_dscp && - fa->fa_dscp != inet_dsfield_to_dscp(flp->flowi4_tos)) + if (fa->fa_dscp && !fib_dscp_masked_match(fa->fa_dscp, flp)) continue; if (fa->tb_id != tb->tb_id) continue; diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c index 8f30e3f00b7f..09e31757e96c 100644 --- a/net/ipv4/fib_trie.c +++ b/net/ipv4/fib_trie.c @@ -1580,8 +1580,7 @@ found: if (index >= (1ul << fa->fa_slen)) continue; } - if (fa->fa_dscp && - inet_dscp_to_dsfield(fa->fa_dscp) != flp->flowi4_tos) + if (fa->fa_dscp && !fib_dscp_masked_match(fa->fa_dscp, flp)) continue; /* Paired with WRITE_ONCE() in fib_release_info() */ if (READ_ONCE(fi->fib_dead)) -- cgit v1.2.3 From f44554b5067b36c14cc91ed811fa1bd58baed34a Mon Sep 17 00:00:00 2001 From: Deven Bowers Date: Fri, 2 Aug 2024 23:08:23 -0700 Subject: audit,ipe: add IPE auditing support Users of IPE require a way to identify when and why an operation fails, allowing them to both respond to violations of policy and be notified of potentially malicious actions on their systems with respect to IPE itself. This patch introduces 3 new audit events. AUDIT_IPE_ACCESS(1420) indicates the result of an IPE policy evaluation of a resource. AUDIT_IPE_CONFIG_CHANGE(1421) indicates the current active IPE policy has been changed to another loaded policy. AUDIT_IPE_POLICY_LOAD(1422) indicates a new IPE policy has been loaded into the kernel. This patch also adds support for success auditing, allowing users to identify why an allow decision was made for a resource. However, it is recommended to use this option with caution, as it is quite noisy. Here are some examples of the new audit record types: AUDIT_IPE_ACCESS(1420): audit: AUDIT1420 ipe_op=EXECUTE ipe_hook=BPRM_CHECK enforcing=1 pid=297 comm="sh" path="/root/vol/bin/hello" dev="tmpfs" ino=3897 rule="op=EXECUTE boot_verified=TRUE action=ALLOW" audit: AUDIT1420 ipe_op=EXECUTE ipe_hook=BPRM_CHECK enforcing=1 pid=299 comm="sh" path="/mnt/ipe/bin/hello" dev="dm-0" ino=2 rule="DEFAULT action=DENY" audit: AUDIT1420 ipe_op=EXECUTE ipe_hook=BPRM_CHECK enforcing=1 pid=300 path="/tmp/tmpdp2h1lub/deny/bin/hello" dev="tmpfs" ino=131 rule="DEFAULT action=DENY" The above three records were generated when the active IPE policy only allows binaries from the initramfs to run. The three identical `hello` binary were placed at different locations, only the first hello from the rootfs(initramfs) was allowed. Field ipe_op followed by the IPE operation name associated with the log. Field ipe_hook followed by the name of the LSM hook that triggered the IPE event. Field enforcing followed by the enforcement state of IPE. (it will be introduced in the next commit) Field pid followed by the pid of the process that triggered the IPE event. Field comm followed by the command line program name of the process that triggered the IPE event. Field path followed by the file's path name. Field dev followed by the device name as found in /dev where the file is from. Note that for device mappers it will use the name `dm-X` instead of the name in /dev/mapper. For a file in a temp file system, which is not from a device, it will use `tmpfs` for the field. The implementation of this part is following another existing use case LSM_AUDIT_DATA_INODE in security/lsm_audit.c Field ino followed by the file's inode number. Field rule followed by the IPE rule made the access decision. The whole rule must be audited because the decision is based on the combination of all property conditions in the rule. Along with the syscall audit event, user can know why a blocked happened. For example: audit: AUDIT1420 ipe_op=EXECUTE ipe_hook=BPRM_CHECK enforcing=1 pid=2138 comm="bash" path="/mnt/ipe/bin/hello" dev="dm-0" ino=2 rule="DEFAULT action=DENY" audit[1956]: SYSCALL arch=c000003e syscall=59 success=no exit=-13 a0=556790138df0 a1=556790135390 a2=5567901338b0 a3=ab2a41a67f4f1f4e items=1 ppid=147 pid=1956 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=pts0 ses=4294967295 comm="bash" exe="/usr/bin/bash" key=(null) The above two records showed bash used execve to run "hello" and got blocked by IPE. Note that the IPE records are always prior to a SYSCALL record. AUDIT_IPE_CONFIG_CHANGE(1421): audit: AUDIT1421 old_active_pol_name="Allow_All" old_active_pol_version=0.0.0 old_policy_digest=sha256:E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649 new_active_pol_name="boot_verified" new_active_pol_version=0.0.0 new_policy_digest=sha256:820EEA5B40CA42B51F68962354BA083122A20BB846F auid=4294967295 ses=4294967295 lsm=ipe res=1 The above record showed the current IPE active policy switch from `Allow_All` to `boot_verified` along with the version and the hash digest of the two policies. Note IPE can only have one policy active at a time, all access decision evaluation is based on the current active policy. The normal procedure to deploy a policy is loading the policy to deploy into the kernel first, then switch the active policy to it. AUDIT_IPE_POLICY_LOAD(1422): audit: AUDIT1422 policy_name="boot_verified" policy_version=0.0.0 policy_digest=sha256:820EEA5B40CA42B51F68962354BA083122A20BB846F2676 auid=4294967295 ses=4294967295 lsm=ipe res=1 The above record showed a new policy has been loaded into the kernel with the policy name, policy version and policy hash. Signed-off-by: Deven Bowers Signed-off-by: Fan Wu [PM: subject line tweak] Signed-off-by: Paul Moore --- include/uapi/linux/audit.h | 3 + security/ipe/Kconfig | 2 +- security/ipe/Makefile | 1 + security/ipe/audit.c | 227 +++++++++++++++++++++++++++++++++++++++++++++ security/ipe/audit.h | 18 ++++ security/ipe/eval.c | 45 ++++++--- security/ipe/eval.h | 13 ++- security/ipe/fs.c | 68 ++++++++++++++ security/ipe/hooks.c | 10 +- security/ipe/hooks.h | 11 +++ security/ipe/policy.c | 4 + 11 files changed, 384 insertions(+), 18 deletions(-) create mode 100644 security/ipe/audit.c create mode 100644 security/ipe/audit.h (limited to 'include/uapi') diff --git a/include/uapi/linux/audit.h b/include/uapi/linux/audit.h index d676ed2b246e..75e21a135483 100644 --- a/include/uapi/linux/audit.h +++ b/include/uapi/linux/audit.h @@ -143,6 +143,9 @@ #define AUDIT_MAC_UNLBL_STCDEL 1417 /* NetLabel: del a static label */ #define AUDIT_MAC_CALIPSO_ADD 1418 /* NetLabel: add CALIPSO DOI entry */ #define AUDIT_MAC_CALIPSO_DEL 1419 /* NetLabel: del CALIPSO DOI entry */ +#define AUDIT_IPE_ACCESS 1420 /* IPE denial or grant */ +#define AUDIT_IPE_CONFIG_CHANGE 1421 /* IPE config change */ +#define AUDIT_IPE_POLICY_LOAD 1422 /* IPE policy load */ #define AUDIT_FIRST_KERN_ANOM_MSG 1700 #define AUDIT_LAST_KERN_ANOM_MSG 1799 diff --git a/security/ipe/Kconfig b/security/ipe/Kconfig index e4875fb04883..ac4d558e69d5 100644 --- a/security/ipe/Kconfig +++ b/security/ipe/Kconfig @@ -5,7 +5,7 @@ menuconfig SECURITY_IPE bool "Integrity Policy Enforcement (IPE)" - depends on SECURITY && SECURITYFS + depends on SECURITY && SECURITYFS && AUDIT && AUDITSYSCALL select PKCS7_MESSAGE_PARSER select SYSTEM_DATA_VERIFICATION help diff --git a/security/ipe/Makefile b/security/ipe/Makefile index b97f8c10fe01..62caccba14b4 100644 --- a/security/ipe/Makefile +++ b/security/ipe/Makefile @@ -13,3 +13,4 @@ obj-$(CONFIG_SECURITY_IPE) += \ policy.o \ policy_fs.o \ policy_parser.o \ + audit.o \ diff --git a/security/ipe/audit.c b/security/ipe/audit.c new file mode 100644 index 000000000000..5a859f88cfb4 --- /dev/null +++ b/security/ipe/audit.c @@ -0,0 +1,227 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (C) 2020-2024 Microsoft Corporation. All rights reserved. + */ + +#include +#include +#include +#include + +#include "ipe.h" +#include "eval.h" +#include "hooks.h" +#include "policy.h" +#include "audit.h" + +#define ACTSTR(x) ((x) == IPE_ACTION_ALLOW ? "ALLOW" : "DENY") + +#define IPE_AUDIT_HASH_ALG "sha256" + +#define AUDIT_POLICY_LOAD_FMT "policy_name=\"%s\" policy_version=%hu.%hu.%hu "\ + "policy_digest=" IPE_AUDIT_HASH_ALG ":" +#define AUDIT_OLD_ACTIVE_POLICY_FMT "old_active_pol_name=\"%s\" "\ + "old_active_pol_version=%hu.%hu.%hu "\ + "old_policy_digest=" IPE_AUDIT_HASH_ALG ":" +#define AUDIT_OLD_ACTIVE_POLICY_NULL_FMT "old_active_pol_name=? "\ + "old_active_pol_version=? "\ + "old_policy_digest=?" +#define AUDIT_NEW_ACTIVE_POLICY_FMT "new_active_pol_name=\"%s\" "\ + "new_active_pol_version=%hu.%hu.%hu "\ + "new_policy_digest=" IPE_AUDIT_HASH_ALG ":" + +static const char *const audit_op_names[__IPE_OP_MAX + 1] = { + "EXECUTE", + "FIRMWARE", + "KMODULE", + "KEXEC_IMAGE", + "KEXEC_INITRAMFS", + "POLICY", + "X509_CERT", + "UNKNOWN", +}; + +static const char *const audit_hook_names[__IPE_HOOK_MAX] = { + "BPRM_CHECK", + "MMAP", + "MPROTECT", + "KERNEL_READ", + "KERNEL_LOAD", +}; + +static const char *const audit_prop_names[__IPE_PROP_MAX] = { + "boot_verified=FALSE", + "boot_verified=TRUE", +}; + +/** + * audit_rule() - audit an IPE policy rule. + * @ab: Supplies a pointer to the audit_buffer to append to. + * @r: Supplies a pointer to the ipe_rule to approximate a string form for. + */ +static void audit_rule(struct audit_buffer *ab, const struct ipe_rule *r) +{ + const struct ipe_prop *ptr; + + audit_log_format(ab, " rule=\"op=%s ", audit_op_names[r->op]); + + list_for_each_entry(ptr, &r->props, next) + audit_log_format(ab, "%s ", audit_prop_names[ptr->type]); + + audit_log_format(ab, "action=%s\"", ACTSTR(r->action)); +} + +/** + * ipe_audit_match() - Audit a rule match in a policy evaluation. + * @ctx: Supplies a pointer to the evaluation context that was used in the + * evaluation. + * @match_type: Supplies the scope of the match: rule, operation default, + * global default. + * @act: Supplies the IPE's evaluation decision, deny or allow. + * @r: Supplies a pointer to the rule that was matched, if possible. + */ +void ipe_audit_match(const struct ipe_eval_ctx *const ctx, + enum ipe_match match_type, + enum ipe_action_type act, const struct ipe_rule *const r) +{ + const char *op = audit_op_names[ctx->op]; + char comm[sizeof(current->comm)]; + struct audit_buffer *ab; + struct inode *inode; + + if (act != IPE_ACTION_DENY && !READ_ONCE(success_audit)) + return; + + ab = audit_log_start(audit_context(), GFP_ATOMIC | __GFP_NOWARN, + AUDIT_IPE_ACCESS); + if (!ab) + return; + + audit_log_format(ab, "ipe_op=%s ipe_hook=%s pid=%d comm=", + op, audit_hook_names[ctx->hook], + task_tgid_nr(current)); + audit_log_untrustedstring(ab, get_task_comm(comm, current)); + + if (ctx->file) { + audit_log_d_path(ab, " path=", &ctx->file->f_path); + inode = file_inode(ctx->file); + if (inode) { + audit_log_format(ab, " dev="); + audit_log_untrustedstring(ab, inode->i_sb->s_id); + audit_log_format(ab, " ino=%lu", inode->i_ino); + } else { + audit_log_format(ab, " dev=? ino=?"); + } + } else { + audit_log_format(ab, " path=? dev=? ino=?"); + } + + if (match_type == IPE_MATCH_RULE) + audit_rule(ab, r); + else if (match_type == IPE_MATCH_TABLE) + audit_log_format(ab, " rule=\"DEFAULT op=%s action=%s\"", op, + ACTSTR(act)); + else + audit_log_format(ab, " rule=\"DEFAULT action=%s\"", + ACTSTR(act)); + + audit_log_end(ab); +} + +/** + * audit_policy() - Audit a policy's name, version and thumbprint to @ab. + * @ab: Supplies a pointer to the audit buffer to append to. + * @audit_format: Supplies a pointer to the audit format string + * @p: Supplies a pointer to the policy to audit. + */ +static void audit_policy(struct audit_buffer *ab, + const char *audit_format, + const struct ipe_policy *const p) +{ + SHASH_DESC_ON_STACK(desc, tfm); + struct crypto_shash *tfm; + u8 *digest = NULL; + + tfm = crypto_alloc_shash(IPE_AUDIT_HASH_ALG, 0, 0); + if (IS_ERR(tfm)) + return; + + desc->tfm = tfm; + + digest = kzalloc(crypto_shash_digestsize(tfm), GFP_KERNEL); + if (!digest) + goto out; + + if (crypto_shash_init(desc)) + goto out; + + if (crypto_shash_update(desc, p->pkcs7, p->pkcs7len)) + goto out; + + if (crypto_shash_final(desc, digest)) + goto out; + + audit_log_format(ab, audit_format, p->parsed->name, + p->parsed->version.major, p->parsed->version.minor, + p->parsed->version.rev); + audit_log_n_hex(ab, digest, crypto_shash_digestsize(tfm)); + +out: + kfree(digest); + crypto_free_shash(tfm); +} + +/** + * ipe_audit_policy_activation() - Audit a policy being activated. + * @op: Supplies a pointer to the previously activated policy to audit. + * @np: Supplies a pointer to the newly activated policy to audit. + */ +void ipe_audit_policy_activation(const struct ipe_policy *const op, + const struct ipe_policy *const np) +{ + struct audit_buffer *ab; + + ab = audit_log_start(audit_context(), GFP_KERNEL, + AUDIT_IPE_CONFIG_CHANGE); + if (!ab) + return; + + if (op) { + audit_policy(ab, AUDIT_OLD_ACTIVE_POLICY_FMT, op); + audit_log_format(ab, " "); + } else { + /* + * old active policy can be NULL if there is no kernel + * built-in policy + */ + audit_log_format(ab, AUDIT_OLD_ACTIVE_POLICY_NULL_FMT); + audit_log_format(ab, " "); + } + audit_policy(ab, AUDIT_NEW_ACTIVE_POLICY_FMT, np); + audit_log_format(ab, " auid=%u ses=%u lsm=ipe res=1", + from_kuid(&init_user_ns, audit_get_loginuid(current)), + audit_get_sessionid(current)); + + audit_log_end(ab); +} + +/** + * ipe_audit_policy_load() - Audit a policy being loaded into the kernel. + * @p: Supplies a pointer to the policy to audit. + */ +void ipe_audit_policy_load(const struct ipe_policy *const p) +{ + struct audit_buffer *ab; + + ab = audit_log_start(audit_context(), GFP_KERNEL, + AUDIT_IPE_POLICY_LOAD); + if (!ab) + return; + + audit_policy(ab, AUDIT_POLICY_LOAD_FMT, p); + audit_log_format(ab, " auid=%u ses=%u lsm=ipe res=1", + from_kuid(&init_user_ns, audit_get_loginuid(current)), + audit_get_sessionid(current)); + + audit_log_end(ab); +} diff --git a/security/ipe/audit.h b/security/ipe/audit.h new file mode 100644 index 000000000000..3ba8b8a91541 --- /dev/null +++ b/security/ipe/audit.h @@ -0,0 +1,18 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (C) 2020-2024 Microsoft Corporation. All rights reserved. + */ + +#ifndef _IPE_AUDIT_H +#define _IPE_AUDIT_H + +#include "policy.h" + +void ipe_audit_match(const struct ipe_eval_ctx *const ctx, + enum ipe_match match_type, + enum ipe_action_type act, const struct ipe_rule *const r); +void ipe_audit_policy_load(const struct ipe_policy *const p); +void ipe_audit_policy_activation(const struct ipe_policy *const op, + const struct ipe_policy *const np); + +#endif /* _IPE_AUDIT_H */ diff --git a/security/ipe/eval.c b/security/ipe/eval.c index d73d73dfed52..b99ed4bb09cf 100644 --- a/security/ipe/eval.c +++ b/security/ipe/eval.c @@ -9,12 +9,15 @@ #include #include #include +#include #include "ipe.h" #include "eval.h" #include "policy.h" +#include "audit.h" struct ipe_policy __rcu *ipe_active_policy; +bool success_audit; #define FILE_SUPERBLOCK(f) ((f)->f_path.mnt->mnt_sb) @@ -33,13 +36,16 @@ static void build_ipe_sb_ctx(struct ipe_eval_ctx *ctx, const struct file *const * @ctx: Supplies a pointer to the context to be populated. * @file: Supplies a pointer to the file to associated with the evaluation. * @op: Supplies the IPE policy operation associated with the evaluation. + * @hook: Supplies the LSM hook associated with the evaluation. */ void ipe_build_eval_ctx(struct ipe_eval_ctx *ctx, const struct file *file, - enum ipe_op_type op) + enum ipe_op_type op, + enum ipe_hook_type hook) { ctx->file = file; ctx->op = op; + ctx->hook = hook; if (file) build_ipe_sb_ctx(ctx, file); @@ -100,6 +106,7 @@ int ipe_evaluate_event(const struct ipe_eval_ctx *const ctx) struct ipe_policy *pol = NULL; struct ipe_prop *prop = NULL; enum ipe_action_type action; + enum ipe_match match_type; bool match = false; rcu_read_lock(); @@ -111,14 +118,14 @@ int ipe_evaluate_event(const struct ipe_eval_ctx *const ctx) } if (ctx->op == IPE_OP_INVALID) { - if (pol->parsed->global_default_action == IPE_ACTION_DENY) { - rcu_read_unlock(); - return -EACCES; - } - if (pol->parsed->global_default_action == IPE_ACTION_INVALID) + if (pol->parsed->global_default_action == IPE_ACTION_INVALID) { WARN(1, "no default rule set for unknown op, ALLOW it"); - rcu_read_unlock(); - return 0; + action = IPE_ACTION_ALLOW; + } else { + action = pol->parsed->global_default_action; + } + match_type = IPE_MATCH_GLOBAL; + goto eval; } rules = &pol->parsed->rules[ctx->op]; @@ -136,16 +143,32 @@ int ipe_evaluate_event(const struct ipe_eval_ctx *const ctx) break; } - if (match) + if (match) { action = rule->action; - else if (rules->default_action != IPE_ACTION_INVALID) + match_type = IPE_MATCH_RULE; + } else if (rules->default_action != IPE_ACTION_INVALID) { action = rules->default_action; - else + match_type = IPE_MATCH_TABLE; + } else { action = pol->parsed->global_default_action; + match_type = IPE_MATCH_GLOBAL; + } +eval: + ipe_audit_match(ctx, match_type, action, rule); rcu_read_unlock(); + if (action == IPE_ACTION_DENY) return -EACCES; return 0; } + +/* Set the right module name */ +#ifdef KBUILD_MODNAME +#undef KBUILD_MODNAME +#define KBUILD_MODNAME "ipe" +#endif + +module_param(success_audit, bool, 0400); +MODULE_PARM_DESC(success_audit, "Start IPE with success auditing enabled"); diff --git a/security/ipe/eval.h b/security/ipe/eval.h index 0fa6492354dd..42b74a7a7c2b 100644 --- a/security/ipe/eval.h +++ b/security/ipe/eval.h @@ -10,10 +10,12 @@ #include #include "policy.h" +#include "hooks.h" #define IPE_EVAL_CTX_INIT ((struct ipe_eval_ctx){ 0 }) extern struct ipe_policy __rcu *ipe_active_policy; +extern bool success_audit; struct ipe_superblock { bool initramfs; @@ -21,14 +23,23 @@ struct ipe_superblock { struct ipe_eval_ctx { enum ipe_op_type op; + enum ipe_hook_type hook; const struct file *file; bool initramfs; }; +enum ipe_match { + IPE_MATCH_RULE = 0, + IPE_MATCH_TABLE, + IPE_MATCH_GLOBAL, + __IPE_MATCH_MAX +}; + void ipe_build_eval_ctx(struct ipe_eval_ctx *ctx, const struct file *file, - enum ipe_op_type op); + enum ipe_op_type op, + enum ipe_hook_type hook); int ipe_evaluate_event(const struct ipe_eval_ctx *const ctx); #endif /* _IPE_EVAL_H */ diff --git a/security/ipe/fs.c b/security/ipe/fs.c index 49484c8feead..9e410982b759 100644 --- a/security/ipe/fs.c +++ b/security/ipe/fs.c @@ -8,11 +8,62 @@ #include "ipe.h" #include "fs.h" +#include "eval.h" #include "policy.h" +#include "audit.h" static struct dentry *np __ro_after_init; static struct dentry *root __ro_after_init; struct dentry *policy_root __ro_after_init; +static struct dentry *audit_node __ro_after_init; + +/** + * setaudit() - Write handler for the securityfs node, "ipe/success_audit" + * @f: Supplies a file structure representing the securityfs node. + * @data: Supplies a buffer passed to the write syscall. + * @len: Supplies the length of @data. + * @offset: unused. + * + * Return: + * * Length of buffer written - Success + * * %-EPERM - Insufficient permission + */ +static ssize_t setaudit(struct file *f, const char __user *data, + size_t len, loff_t *offset) +{ + int rc = 0; + bool value; + + if (!file_ns_capable(f, &init_user_ns, CAP_MAC_ADMIN)) + return -EPERM; + + rc = kstrtobool_from_user(data, len, &value); + if (rc) + return rc; + + WRITE_ONCE(success_audit, value); + + return len; +} + +/** + * getaudit() - Read handler for the securityfs node, "ipe/success_audit" + * @f: Supplies a file structure representing the securityfs node. + * @data: Supplies a buffer passed to the read syscall. + * @len: Supplies the length of @data. + * @offset: unused. + * + * Return: Length of buffer written + */ +static ssize_t getaudit(struct file *f, char __user *data, + size_t len, loff_t *offset) +{ + const char *result; + + result = ((READ_ONCE(success_audit)) ? "1" : "0"); + + return simple_read_from_buffer(data, len, offset, result, 1); +} /** * new_policy() - Write handler for the securityfs node, "ipe/new_policy". @@ -51,6 +102,10 @@ static ssize_t new_policy(struct file *f, const char __user *data, } rc = ipe_new_policyfs_node(p); + if (rc) + goto out; + + ipe_audit_policy_load(p); out: if (rc < 0) @@ -63,6 +118,11 @@ static const struct file_operations np_fops = { .write = new_policy, }; +static const struct file_operations audit_fops = { + .write = setaudit, + .read = getaudit, +}; + /** * ipe_init_securityfs() - Initialize IPE's securityfs tree at fsinit. * @@ -82,6 +142,13 @@ static int __init ipe_init_securityfs(void) goto err; } + audit_node = securityfs_create_file("success_audit", 0600, root, + NULL, &audit_fops); + if (IS_ERR(audit_node)) { + rc = PTR_ERR(audit_node); + goto err; + } + policy_root = securityfs_create_dir("policies", root); if (IS_ERR(policy_root)) { rc = PTR_ERR(policy_root); @@ -98,6 +165,7 @@ static int __init ipe_init_securityfs(void) err: securityfs_remove(np); securityfs_remove(policy_root); + securityfs_remove(audit_node); securityfs_remove(root); return rc; } diff --git a/security/ipe/hooks.c b/security/ipe/hooks.c index 0bd351e2b32a..e92228723784 100644 --- a/security/ipe/hooks.c +++ b/security/ipe/hooks.c @@ -29,7 +29,7 @@ int ipe_bprm_check_security(struct linux_binprm *bprm) { struct ipe_eval_ctx ctx = IPE_EVAL_CTX_INIT; - ipe_build_eval_ctx(&ctx, bprm->file, IPE_OP_EXEC); + ipe_build_eval_ctx(&ctx, bprm->file, IPE_OP_EXEC, IPE_HOOK_BPRM_CHECK); return ipe_evaluate_event(&ctx); } @@ -54,7 +54,7 @@ int ipe_mmap_file(struct file *f, unsigned long reqprot __always_unused, struct ipe_eval_ctx ctx = IPE_EVAL_CTX_INIT; if (prot & PROT_EXEC) { - ipe_build_eval_ctx(&ctx, f, IPE_OP_EXEC); + ipe_build_eval_ctx(&ctx, f, IPE_OP_EXEC, IPE_HOOK_MMAP); return ipe_evaluate_event(&ctx); } @@ -86,7 +86,7 @@ int ipe_file_mprotect(struct vm_area_struct *vma, return 0; if (prot & PROT_EXEC) { - ipe_build_eval_ctx(&ctx, vma->vm_file, IPE_OP_EXEC); + ipe_build_eval_ctx(&ctx, vma->vm_file, IPE_OP_EXEC, IPE_HOOK_MPROTECT); return ipe_evaluate_event(&ctx); } @@ -135,7 +135,7 @@ int ipe_kernel_read_file(struct file *file, enum kernel_read_file_id id, WARN(1, "no rule setup for kernel_read_file enum %d", id); } - ipe_build_eval_ctx(&ctx, file, op); + ipe_build_eval_ctx(&ctx, file, op, IPE_HOOK_KERNEL_READ); return ipe_evaluate_event(&ctx); } @@ -180,7 +180,7 @@ int ipe_kernel_load_data(enum kernel_load_data_id id, bool contents) WARN(1, "no rule setup for kernel_load_data enum %d", id); } - ipe_build_eval_ctx(&ctx, NULL, op); + ipe_build_eval_ctx(&ctx, NULL, op, IPE_HOOK_KERNEL_LOAD); return ipe_evaluate_event(&ctx); } diff --git a/security/ipe/hooks.h b/security/ipe/hooks.h index 4de5fabebd54..f4f0b544ddcc 100644 --- a/security/ipe/hooks.h +++ b/security/ipe/hooks.h @@ -9,6 +9,17 @@ #include #include +enum ipe_hook_type { + IPE_HOOK_BPRM_CHECK = 0, + IPE_HOOK_MMAP, + IPE_HOOK_MPROTECT, + IPE_HOOK_KERNEL_READ, + IPE_HOOK_KERNEL_LOAD, + __IPE_HOOK_MAX +}; + +#define IPE_HOOK_INVALID __IPE_HOOK_MAX + int ipe_bprm_check_security(struct linux_binprm *bprm); int ipe_mmap_file(struct file *f, unsigned long reqprot, unsigned long prot, diff --git a/security/ipe/policy.c b/security/ipe/policy.c index be9808b27e49..d8e7db857a2e 100644 --- a/security/ipe/policy.c +++ b/security/ipe/policy.c @@ -11,6 +11,7 @@ #include "fs.h" #include "policy.h" #include "policy_parser.h" +#include "audit.h" /* lock for synchronizing writers across ipe policy */ DEFINE_MUTEX(ipe_policy_lock); @@ -112,6 +113,7 @@ int ipe_update_policy(struct inode *root, const char *text, size_t textlen, root->i_private = new; swap(new->policyfs, old->policyfs); + ipe_audit_policy_load(new); mutex_lock(&ipe_policy_lock); ap = rcu_dereference_protected(ipe_active_policy, @@ -119,6 +121,7 @@ int ipe_update_policy(struct inode *root, const char *text, size_t textlen, if (old == ap) { rcu_assign_pointer(ipe_active_policy, new); mutex_unlock(&ipe_policy_lock); + ipe_audit_policy_activation(old, new); } else { mutex_unlock(&ipe_policy_lock); } @@ -217,6 +220,7 @@ int ipe_set_active_pol(const struct ipe_policy *p) } rcu_assign_pointer(ipe_active_policy, p); + ipe_audit_policy_activation(ap, p); mutex_unlock(&ipe_policy_lock); return 0; -- cgit v1.2.3 From 5151fa35ae5979821d091b80096b4c790b187bac Mon Sep 17 00:00:00 2001 From: Juha-Pekka Heikkila Date: Fri, 16 Aug 2024 14:52:28 +0300 Subject: drm/fourcc: define Intel Xe2 related tile4 ccs modifiers Add Tile4 type ccs modifiers to indicate presence of compression on Xe2. Here is defined I915_FORMAT_MOD_4_TILED_LNL_CCS which is meant for integrated graphics with igpu related limitations Here is also defined I915_FORMAT_MOD_4_TILED_BMG_CCS which is meant for discrete graphics with dgpu related limitations Signed-off-by: Juha-Pekka Heikkila Reviewed-by: Lucas De Marchi Acked-by: Maarten Lankhorst Link: https://patchwork.freedesktop.org/patch/msgid/20240816115229.531671-3-juhapekka.heikkila@gmail.com Signed-off-by: Rodrigo Vivi --- include/uapi/drm/drm_fourcc.h | 25 +++++++++++++++++++++++++ 1 file changed, 25 insertions(+) (limited to 'include/uapi') diff --git a/include/uapi/drm/drm_fourcc.h b/include/uapi/drm/drm_fourcc.h index 2d84a8052b15..78abd819fd62 100644 --- a/include/uapi/drm/drm_fourcc.h +++ b/include/uapi/drm/drm_fourcc.h @@ -702,6 +702,31 @@ extern "C" { */ #define I915_FORMAT_MOD_4_TILED_MTL_RC_CCS_CC fourcc_mod_code(INTEL, 15) +/* + * Intel Color Control Surfaces (CCS) for graphics ver. 20 unified compression + * on integrated graphics + * + * The main surface is Tile 4 and at plane index 0. For semi-planar formats + * like NV12, the Y and UV planes are Tile 4 and are located at plane indices + * 0 and 1, respectively. The CCS for all planes are stored outside of the + * GEM object in a reserved memory area dedicated for the storage of the + * CCS data for all compressible GEM objects. + */ +#define I915_FORMAT_MOD_4_TILED_LNL_CCS fourcc_mod_code(INTEL, 16) + +/* + * Intel Color Control Surfaces (CCS) for graphics ver. 20 unified compression + * on discrete graphics + * + * The main surface is Tile 4 and at plane index 0. For semi-planar formats + * like NV12, the Y and UV planes are Tile 4 and are located at plane indices + * 0 and 1, respectively. The CCS for all planes are stored outside of the + * GEM object in a reserved memory area dedicated for the storage of the + * CCS data for all compressible GEM objects. The GEM object must be stored in + * contiguous memory with a size aligned to 64KB + */ +#define I915_FORMAT_MOD_4_TILED_BMG_CCS fourcc_mod_code(INTEL, 17) + /* * Tiled, NV12MT, grouped in 64 (pixels) x 32 (lines) -sized macroblocks * -- cgit v1.2.3 From 273f8c142003dfd874d45b1a60965809e95ccd50 Mon Sep 17 00:00:00 2001 From: Justin Iurman Date: Sat, 17 Aug 2024 15:18:18 +0200 Subject: net: ipv6: ioam6: new feature tunsrc This patch provides a new feature (i.e., "tunsrc") for the tunnel (i.e., "encap") mode of ioam6. Just like seg6 already does, except it is attached to a route. The "tunsrc" is optional: when not provided (by default), the automatic resolution is applied. Using "tunsrc" when possible has a benefit: performance. See the comparison: - before (= "encap" mode): https://ibb.co/bNCzvf7 - after (= "encap" mode with "tunsrc"): https://ibb.co/PT8L6yq Signed-off-by: Justin Iurman Signed-off-by: Paolo Abeni --- include/uapi/linux/ioam6_iptunnel.h | 6 ++++ net/ipv6/ioam6_iptunnel.c | 65 ++++++++++++++++++++++++++++++++----- 2 files changed, 63 insertions(+), 8 deletions(-) (limited to 'include/uapi') diff --git a/include/uapi/linux/ioam6_iptunnel.h b/include/uapi/linux/ioam6_iptunnel.h index 38f6a8fdfd34..8aef21e4a8c1 100644 --- a/include/uapi/linux/ioam6_iptunnel.h +++ b/include/uapi/linux/ioam6_iptunnel.h @@ -50,6 +50,12 @@ enum { IOAM6_IPTUNNEL_FREQ_K, /* u32 */ IOAM6_IPTUNNEL_FREQ_N, /* u32 */ + /* Tunnel src address. + * For encap,auto modes. + * Optional (automatic if not provided). + */ + IOAM6_IPTUNNEL_SRC, /* struct in6_addr */ + __IOAM6_IPTUNNEL_MAX, }; diff --git a/net/ipv6/ioam6_iptunnel.c b/net/ipv6/ioam6_iptunnel.c index cd2522f04edf..e34e1ff24546 100644 --- a/net/ipv6/ioam6_iptunnel.c +++ b/net/ipv6/ioam6_iptunnel.c @@ -42,6 +42,8 @@ struct ioam6_lwt { struct ioam6_lwt_freq freq; atomic_t pkt_cnt; u8 mode; + bool has_tunsrc; + struct in6_addr tunsrc; struct in6_addr tundst; struct ioam6_lwt_encap tuninfo; }; @@ -72,6 +74,7 @@ static const struct nla_policy ioam6_iptunnel_policy[IOAM6_IPTUNNEL_MAX + 1] = { [IOAM6_IPTUNNEL_MODE] = NLA_POLICY_RANGE(NLA_U8, IOAM6_IPTUNNEL_MODE_MIN, IOAM6_IPTUNNEL_MODE_MAX), + [IOAM6_IPTUNNEL_SRC] = NLA_POLICY_EXACT_LEN(sizeof(struct in6_addr)), [IOAM6_IPTUNNEL_DST] = NLA_POLICY_EXACT_LEN(sizeof(struct in6_addr)), [IOAM6_IPTUNNEL_TRACE] = NLA_POLICY_EXACT_LEN( sizeof(struct ioam6_trace_hdr)), @@ -144,6 +147,11 @@ static int ioam6_build_state(struct net *net, struct nlattr *nla, else mode = nla_get_u8(tb[IOAM6_IPTUNNEL_MODE]); + if (tb[IOAM6_IPTUNNEL_SRC] && mode == IOAM6_IPTUNNEL_MODE_INLINE) { + NL_SET_ERR_MSG(extack, "no tunnel src expected with this mode"); + return -EINVAL; + } + if (!tb[IOAM6_IPTUNNEL_DST] && mode != IOAM6_IPTUNNEL_MODE_INLINE) { NL_SET_ERR_MSG(extack, "this mode needs a tunnel destination"); return -EINVAL; @@ -168,16 +176,29 @@ static int ioam6_build_state(struct net *net, struct nlattr *nla, ilwt = ioam6_lwt_state(lwt); err = dst_cache_init(&ilwt->cache, GFP_ATOMIC); - if (err) { - kfree(lwt); - return err; - } + if (err) + goto free_lwt; atomic_set(&ilwt->pkt_cnt, 0); ilwt->freq.k = freq_k; ilwt->freq.n = freq_n; ilwt->mode = mode; + + if (!tb[IOAM6_IPTUNNEL_SRC]) { + ilwt->has_tunsrc = false; + } else { + ilwt->has_tunsrc = true; + ilwt->tunsrc = nla_get_in6_addr(tb[IOAM6_IPTUNNEL_SRC]); + + if (ipv6_addr_any(&ilwt->tunsrc)) { + NL_SET_ERR_MSG_ATTR(extack, tb[IOAM6_IPTUNNEL_SRC], + "invalid tunnel source address"); + err = -EINVAL; + goto free_cache; + } + } + if (tb[IOAM6_IPTUNNEL_DST]) ilwt->tundst = nla_get_in6_addr(tb[IOAM6_IPTUNNEL_DST]); @@ -202,6 +223,11 @@ static int ioam6_build_state(struct net *net, struct nlattr *nla, *ts = lwt; return 0; +free_cache: + dst_cache_destroy(&ilwt->cache); +free_lwt: + kfree(lwt); + return err; } static int ioam6_do_fill(struct net *net, struct sk_buff *skb) @@ -257,6 +283,8 @@ static int ioam6_do_inline(struct net *net, struct sk_buff *skb, static int ioam6_do_encap(struct net *net, struct sk_buff *skb, struct ioam6_lwt_encap *tuninfo, + bool has_tunsrc, + struct in6_addr *tunsrc, struct in6_addr *tundst) { struct dst_entry *dst = skb_dst(skb); @@ -286,8 +314,12 @@ static int ioam6_do_encap(struct net *net, struct sk_buff *skb, hdr->nexthdr = NEXTHDR_HOP; hdr->payload_len = cpu_to_be16(skb->len - sizeof(*hdr)); hdr->daddr = *tundst; - ipv6_dev_get_saddr(net, dst->dev, &hdr->daddr, - IPV6_PREFER_SRC_PUBLIC, &hdr->saddr); + + if (has_tunsrc) + memcpy(&hdr->saddr, tunsrc, sizeof(*tunsrc)); + else + ipv6_dev_get_saddr(net, dst->dev, &hdr->daddr, + IPV6_PREFER_SRC_PUBLIC, &hdr->saddr); skb_postpush_rcsum(skb, hdr, len); @@ -329,7 +361,9 @@ do_inline: case IOAM6_IPTUNNEL_MODE_ENCAP: do_encap: /* Encapsulation (ip6ip6) */ - err = ioam6_do_encap(net, skb, &ilwt->tuninfo, &ilwt->tundst); + err = ioam6_do_encap(net, skb, &ilwt->tuninfo, + ilwt->has_tunsrc, &ilwt->tunsrc, + &ilwt->tundst); if (unlikely(err)) goto drop; @@ -415,6 +449,13 @@ static int ioam6_fill_encap_info(struct sk_buff *skb, goto ret; if (ilwt->mode != IOAM6_IPTUNNEL_MODE_INLINE) { + if (ilwt->has_tunsrc) { + err = nla_put_in6_addr(skb, IOAM6_IPTUNNEL_SRC, + &ilwt->tunsrc); + if (err) + goto ret; + } + err = nla_put_in6_addr(skb, IOAM6_IPTUNNEL_DST, &ilwt->tundst); if (err) goto ret; @@ -436,8 +477,12 @@ static int ioam6_encap_nlsize(struct lwtunnel_state *lwtstate) nla_total_size(sizeof(ilwt->mode)) + nla_total_size(sizeof(ilwt->tuninfo.traceh)); - if (ilwt->mode != IOAM6_IPTUNNEL_MODE_INLINE) + if (ilwt->mode != IOAM6_IPTUNNEL_MODE_INLINE) { + if (ilwt->has_tunsrc) + nlsize += nla_total_size(sizeof(ilwt->tunsrc)); + nlsize += nla_total_size(sizeof(ilwt->tundst)); + } return nlsize; } @@ -452,8 +497,12 @@ static int ioam6_encap_cmp(struct lwtunnel_state *a, struct lwtunnel_state *b) return (ilwt_a->freq.k != ilwt_b->freq.k || ilwt_a->freq.n != ilwt_b->freq.n || ilwt_a->mode != ilwt_b->mode || + ilwt_a->has_tunsrc != ilwt_b->has_tunsrc || (ilwt_a->mode != IOAM6_IPTUNNEL_MODE_INLINE && !ipv6_addr_equal(&ilwt_a->tundst, &ilwt_b->tundst)) || + (ilwt_a->mode != IOAM6_IPTUNNEL_MODE_INLINE && + ilwt_a->has_tunsrc && + !ipv6_addr_equal(&ilwt_a->tunsrc, &ilwt_b->tunsrc)) || trace_a->namespace_id != trace_b->namespace_id); } -- cgit v1.2.3 From a139c98f760efa1b6f0ee2f36ea6f62f04c8b20a Mon Sep 17 00:00:00 2001 From: Chris Wulff Date: Sat, 17 Aug 2024 10:28:51 -0400 Subject: USB: gadget: f_hid: Add GET_REPORT via userspace IOCTL While supporting GET_REPORT is a mandatory request per the HID specification the current implementation of the GET_REPORT request responds to the USB Host with an empty reply of the request length. However, some USB Hosts will request the contents of feature reports via the GET_REPORT request. In addition, some proprietary HID 'protocols' will expect different data, for the same report ID, to be to become available in the feature report by sending a preceding SET_REPORT to the USB Device that defines what data is to be presented when that feature report is subsequently retrieved via GET_REPORT (with a very fast < 5ms turn around between the SET_REPORT and the GET_REPORT). There are two other patch sets already submitted for adding GET_REPORT support. The first [1] allows for pre-priming a list of reports via IOCTLs which then allows the USB Host to perform the request, with no further userspace interaction possible during the GET_REPORT request. And another [2] which allows for a single report to be setup by userspace via IOCTL, which will be fetched and returned by the kernel for subsequent GET_REPORT requests by the USB Host, also with no further userspace interaction possible. This patch, while loosely based on both the patch sets, differs by allowing the option for userspace to respond to each GET_REPORT request by setting up a poll to notify userspace that a new GET_REPORT request has arrived. To support this, two extra IOCTLs are supplied. The first of which is used to retrieve the report ID of the GET_REPORT request (in the case of having non-zero report IDs in the HID descriptor). The second IOCTL allows for storing report responses in a list for responding to requests. The report responses are stored in a list (it will be either added if it does not exist or updated if it exists already). A flag (userspace_req) can be set to whether subsequent requests notify userspace or not. Basic operation when a GET_REPORT request arrives from USB Host: - If the report ID exists in the list and it is set for immediate return (i.e. userspace_req == false) then response is sent immediately, userspace is not notified - The report ID does not exist, or exists but is set to notify userspace (i.e. userspace_req == true) then notify userspace via poll: - If userspace responds, and either adds or update the response in the list and respond to the host with the contents - If userspace does not respond within the fixed timeout (2500ms) but the report has been set prevously, then send 'old' report contents - If userspace does not respond within the fixed timeout (2500ms) and the report does not exist in the list then send an empty report Note that userspace could 'prime' the report list at any other time. While this patch allows for flexibility in how the system responds to requests, and therefore the HID 'protocols' that could be supported, a drawback is the time it takes to service the requests and therefore the maximum throughput that would be achievable. The USB HID Specification v1.11 itself states that GET_REPORT is not intended for periodic data polling, so this limitation is not severe. Testing on an iMX8M Nano Ultra Lite with a heavy multi-core CPU loading showed that userspace can typically respond to the GET_REPORT request within 1200ms - which is well within the 5000ms most operating systems seem to allow, and within the 2500ms set by this patch. [1] https://lore.kernel.org/all/20220805070507.123151-2-sunil@amarulasolutions.com/ [2] https://lore.kernel.org/all/20220726005824.2817646-1-vi@endrift.com/ Signed-off-by: David Sands Signed-off-by: Chris Wulff Link: https://lore.kernel.org/r/20240817142850.1311460-2-crwulff@gmail.com Signed-off-by: Greg Kroah-Hartman --- drivers/usb/gadget/function/f_hid.c | 273 +++++++++++++++++++++++++++++++++++- include/uapi/linux/usb/g_hid.h | 40 ++++++ include/uapi/linux/usb/gadgetfs.h | 2 +- 3 files changed, 307 insertions(+), 8 deletions(-) create mode 100644 include/uapi/linux/usb/g_hid.h (limited to 'include/uapi') diff --git a/drivers/usb/gadget/function/f_hid.c b/drivers/usb/gadget/function/f_hid.c index 93dae017ae45..82c7dff6de5c 100644 --- a/drivers/usb/gadget/function/f_hid.c +++ b/drivers/usb/gadget/function/f_hid.c @@ -15,13 +15,21 @@ #include #include #include +#include #include +#include #include "u_f.h" #include "u_hid.h" #define HIDG_MINORS 4 +/* + * Most operating systems seem to allow for 5000ms timeout, we will allow + * userspace half that time to respond before we return an empty report. + */ +#define GET_REPORT_TIMEOUT_MS 2500 + static int major, minors; static const struct class hidg_class = { @@ -31,6 +39,11 @@ static const struct class hidg_class = { static DEFINE_IDA(hidg_ida); static DEFINE_MUTEX(hidg_ida_lock); /* protects access to hidg_ida */ +struct report_entry { + struct usb_hidg_report report_data; + struct list_head node; +}; + /*-------------------------------------------------------------------------*/ /* HID gadget struct */ @@ -75,6 +88,19 @@ struct f_hidg { wait_queue_head_t write_queue; struct usb_request *req; + /* get report */ + struct usb_request *get_req; + struct usb_hidg_report get_report; + bool get_report_returned; + int get_report_req_report_id; + int get_report_req_report_length; + spinlock_t get_report_spinlock; + wait_queue_head_t get_queue; /* Waiting for userspace response */ + wait_queue_head_t get_id_queue; /* Get ID came in */ + struct work_struct work; + struct workqueue_struct *workqueue; + struct list_head report_list; + struct device dev; struct cdev cdev; struct usb_function func; @@ -524,6 +550,174 @@ release_write_pending: return status; } +static struct report_entry *f_hidg_search_for_report(struct f_hidg *hidg, u8 report_id) +{ + struct list_head *ptr; + struct report_entry *entry; + + list_for_each(ptr, &hidg->report_list) { + entry = list_entry(ptr, struct report_entry, node); + if (entry->report_data.report_id == report_id) + return entry; + } + + return NULL; +} + +static void get_report_workqueue_handler(struct work_struct *work) +{ + struct f_hidg *hidg = container_of(work, struct f_hidg, work); + struct usb_composite_dev *cdev = hidg->func.config->cdev; + struct usb_request *req; + struct report_entry *ptr; + unsigned long flags; + + int status = 0; + + spin_lock_irqsave(&hidg->get_report_spinlock, flags); + req = hidg->get_req; + if (!req) { + spin_unlock_irqrestore(&hidg->get_report_spinlock, flags); + return; + } + + req->zero = 0; + req->length = min_t(unsigned int, min_t(unsigned int, hidg->get_report_req_report_length, + hidg->report_length), + MAX_REPORT_LENGTH); + + /* Check if there is a response available for immediate response */ + ptr = f_hidg_search_for_report(hidg, hidg->get_report_req_report_id); + if (ptr && !ptr->report_data.userspace_req) { + /* Report exists in list and it is to be used for immediate response */ + req->buf = ptr->report_data.data; + status = usb_ep_queue(cdev->gadget->ep0, req, GFP_ATOMIC); + hidg->get_report_returned = true; + spin_unlock_irqrestore(&hidg->get_report_spinlock, flags); + } else { + /* + * Report does not exist in list or should not be immediately sent + * i.e. give userspace time to respond + */ + hidg->get_report_returned = false; + spin_unlock_irqrestore(&hidg->get_report_spinlock, flags); + wake_up(&hidg->get_id_queue); +#define GET_REPORT_COND (!hidg->get_report_returned) + /* Wait until userspace has responded or timeout */ + status = wait_event_interruptible_timeout(hidg->get_queue, !GET_REPORT_COND, + msecs_to_jiffies(GET_REPORT_TIMEOUT_MS)); + spin_lock_irqsave(&hidg->get_report_spinlock, flags); + req = hidg->get_req; + if (!req) { + spin_unlock_irqrestore(&hidg->get_report_spinlock, flags); + return; + } + if (status == 0 && !hidg->get_report_returned) { + /* GET_REPORT request was not serviced by userspace within timeout period */ + VDBG(cdev, "get_report : userspace timeout.\n"); + hidg->get_report_returned = true; + } + + /* Search again for report ID in list and respond to GET_REPORT request */ + ptr = f_hidg_search_for_report(hidg, hidg->get_report_req_report_id); + if (ptr) { + /* + * Either get an updated response just serviced by userspace + * or send the latest response in the list + */ + req->buf = ptr->report_data.data; + } else { + /* If there are no prevoiusly sent reports send empty report */ + req->buf = hidg->get_report.data; + memset(req->buf, 0x0, req->length); + } + + status = usb_ep_queue(cdev->gadget->ep0, req, GFP_ATOMIC); + spin_unlock_irqrestore(&hidg->get_report_spinlock, flags); + } + + if (status < 0) + VDBG(cdev, "usb_ep_queue error on ep0 responding to GET_REPORT\n"); +} + +static int f_hidg_get_report_id(struct file *file, __u8 __user *buffer) +{ + struct f_hidg *hidg = file->private_data; + int ret = 0; + + ret = put_user(hidg->get_report_req_report_id, buffer); + + return ret; +} + +static int f_hidg_get_report(struct file *file, struct usb_hidg_report __user *buffer) +{ + struct f_hidg *hidg = file->private_data; + struct usb_composite_dev *cdev = hidg->func.config->cdev; + unsigned long flags; + struct report_entry *entry; + struct report_entry *ptr; + __u8 report_id; + + entry = kmalloc(sizeof(*entry), GFP_KERNEL); + if (!entry) + return -ENOMEM; + + if (copy_from_user(&entry->report_data, buffer, + sizeof(struct usb_hidg_report))) { + ERROR(cdev, "copy_from_user error\n"); + kfree(entry); + return -EINVAL; + } + + report_id = entry->report_data.report_id; + + spin_lock_irqsave(&hidg->get_report_spinlock, flags); + ptr = f_hidg_search_for_report(hidg, report_id); + + if (ptr) { + /* Report already exists in list - update it */ + if (copy_from_user(&ptr->report_data, buffer, + sizeof(struct usb_hidg_report))) { + spin_unlock_irqrestore(&hidg->get_report_spinlock, flags); + ERROR(cdev, "copy_from_user error\n"); + kfree(entry); + return -EINVAL; + } + kfree(entry); + } else { + /* Report does not exist in list - add it */ + list_add_tail(&entry->node, &hidg->report_list); + } + + /* If there is no response pending then do nothing further */ + if (hidg->get_report_returned) { + spin_unlock_irqrestore(&hidg->get_report_spinlock, flags); + return 0; + } + + /* If this userspace response serves the current pending report */ + if (hidg->get_report_req_report_id == report_id) { + hidg->get_report_returned = true; + wake_up(&hidg->get_queue); + } + + spin_unlock_irqrestore(&hidg->get_report_spinlock, flags); + return 0; +} + +static long f_hidg_ioctl(struct file *file, unsigned int code, unsigned long arg) +{ + switch (code) { + case GADGET_HID_READ_GET_REPORT_ID: + return f_hidg_get_report_id(file, (__u8 __user *)arg); + case GADGET_HID_WRITE_GET_REPORT: + return f_hidg_get_report(file, (struct usb_hidg_report __user *)arg); + default: + return -ENOTTY; + } +} + static __poll_t f_hidg_poll(struct file *file, poll_table *wait) { struct f_hidg *hidg = file->private_data; @@ -531,6 +725,8 @@ static __poll_t f_hidg_poll(struct file *file, poll_table *wait) poll_wait(file, &hidg->read_queue, wait); poll_wait(file, &hidg->write_queue, wait); + poll_wait(file, &hidg->get_queue, wait); + poll_wait(file, &hidg->get_id_queue, wait); if (WRITE_COND) ret |= EPOLLOUT | EPOLLWRNORM; @@ -543,12 +739,16 @@ static __poll_t f_hidg_poll(struct file *file, poll_table *wait) ret |= EPOLLIN | EPOLLRDNORM; } + if (GET_REPORT_COND) + ret |= EPOLLPRI; + return ret; } #undef WRITE_COND #undef READ_COND_SSREPORT #undef READ_COND_INTOUT +#undef GET_REPORT_COND static int f_hidg_release(struct inode *inode, struct file *fd) { @@ -641,6 +841,10 @@ static void hidg_ssreport_complete(struct usb_ep *ep, struct usb_request *req) wake_up(&hidg->read_queue); } +static void hidg_get_report_complete(struct usb_ep *ep, struct usb_request *req) +{ +} + static int hidg_setup(struct usb_function *f, const struct usb_ctrlrequest *ctrl) { @@ -649,6 +853,7 @@ static int hidg_setup(struct usb_function *f, struct usb_request *req = cdev->req; int status = 0; __u16 value, length; + unsigned long flags; value = __le16_to_cpu(ctrl->wValue); length = __le16_to_cpu(ctrl->wLength); @@ -660,14 +865,20 @@ static int hidg_setup(struct usb_function *f, switch ((ctrl->bRequestType << 8) | ctrl->bRequest) { case ((USB_DIR_IN | USB_TYPE_CLASS | USB_RECIP_INTERFACE) << 8 | HID_REQ_GET_REPORT): - VDBG(cdev, "get_report\n"); + VDBG(cdev, "get_report | wLength=%d\n", ctrl->wLength); - /* send an empty report */ - length = min_t(unsigned, length, hidg->report_length); - memset(req->buf, 0x0, length); + /* + * Update GET_REPORT ID so that an ioctl can be used to determine what + * GET_REPORT the request was actually for. + */ + spin_lock_irqsave(&hidg->get_report_spinlock, flags); + hidg->get_report_req_report_id = value & 0xff; + hidg->get_report_req_report_length = length; + spin_unlock_irqrestore(&hidg->get_report_spinlock, flags); - goto respond; - break; + queue_work(hidg->workqueue, &hidg->work); + + return status; case ((USB_DIR_IN | USB_TYPE_CLASS | USB_RECIP_INTERFACE) << 8 | HID_REQ_GET_PROTOCOL): @@ -793,6 +1004,14 @@ static void hidg_disable(struct usb_function *f) spin_unlock_irqrestore(&hidg->read_spinlock, flags); } + spin_lock_irqsave(&hidg->get_report_spinlock, flags); + if (!hidg->get_report_returned) { + usb_ep_free_request(f->config->cdev->gadget->ep0, hidg->get_req); + hidg->get_req = NULL; + hidg->get_report_returned = true; + } + spin_unlock_irqrestore(&hidg->get_report_spinlock, flags); + spin_lock_irqsave(&hidg->write_spinlock, flags); if (!hidg->write_pending) { free_ep_req(hidg->in_ep, hidg->req); @@ -902,6 +1121,14 @@ fail: return status; } +#ifdef CONFIG_COMPAT +static long f_hidg_compat_ioctl(struct file *file, unsigned int code, + unsigned long value) +{ + return f_hidg_ioctl(file, code, value); +} +#endif + static const struct file_operations f_hidg_fops = { .owner = THIS_MODULE, .open = f_hidg_open, @@ -909,6 +1136,10 @@ static const struct file_operations f_hidg_fops = { .write = f_hidg_write, .read = f_hidg_read, .poll = f_hidg_poll, + .unlocked_ioctl = f_hidg_ioctl, +#ifdef CONFIG_COMPAT + .compat_ioctl = f_hidg_compat_ioctl, +#endif .llseek = noop_llseek, }; @@ -919,6 +1150,15 @@ static int hidg_bind(struct usb_configuration *c, struct usb_function *f) struct usb_string *us; int status; + hidg->get_req = usb_ep_alloc_request(c->cdev->gadget->ep0, GFP_ATOMIC); + if (!hidg->get_req) + return -ENOMEM; + + hidg->get_req->zero = 0; + hidg->get_req->complete = hidg_get_report_complete; + hidg->get_req->context = hidg; + hidg->get_report_returned = true; + /* maybe allocate device-global string IDs, and patch descriptors */ us = usb_gstrings_attach(c->cdev, ct_func_strings, ARRAY_SIZE(ct_func_string_defs)); @@ -1004,9 +1244,24 @@ static int hidg_bind(struct usb_configuration *c, struct usb_function *f) hidg->write_pending = 1; hidg->req = NULL; spin_lock_init(&hidg->read_spinlock); + spin_lock_init(&hidg->get_report_spinlock); init_waitqueue_head(&hidg->write_queue); init_waitqueue_head(&hidg->read_queue); + init_waitqueue_head(&hidg->get_queue); + init_waitqueue_head(&hidg->get_id_queue); INIT_LIST_HEAD(&hidg->completed_out_req); + INIT_LIST_HEAD(&hidg->report_list); + + INIT_WORK(&hidg->work, get_report_workqueue_handler); + hidg->workqueue = alloc_workqueue("report_work", + WQ_FREEZABLE | + WQ_MEM_RECLAIM, + 1); + + if (!hidg->workqueue) { + status = -ENOMEM; + goto fail; + } /* create char device */ cdev_init(&hidg->cdev, &f_hidg_fops); @@ -1016,12 +1271,16 @@ static int hidg_bind(struct usb_configuration *c, struct usb_function *f) return 0; fail_free_descs: + destroy_workqueue(hidg->workqueue); usb_free_all_descriptors(f); fail: ERROR(f->config->cdev, "hidg_bind FAILED\n"); if (hidg->req != NULL) free_ep_req(hidg->in_ep, hidg->req); + usb_ep_free_request(c->cdev->gadget->ep0, hidg->get_req); + hidg->get_req = NULL; + return status; } @@ -1256,7 +1515,7 @@ static void hidg_unbind(struct usb_configuration *c, struct usb_function *f) struct f_hidg *hidg = func_to_hidg(f); cdev_device_del(&hidg->cdev, &hidg->dev); - + destroy_workqueue(hidg->workqueue); usb_free_all_descriptors(f); } diff --git a/include/uapi/linux/usb/g_hid.h b/include/uapi/linux/usb/g_hid.h new file mode 100644 index 000000000000..b965092db476 --- /dev/null +++ b/include/uapi/linux/usb/g_hid.h @@ -0,0 +1,40 @@ +/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */ + +#ifndef __UAPI_LINUX_USB_G_HID_H +#define __UAPI_LINUX_USB_G_HID_H + +#include + +/* Maximum HID report length for High-Speed USB (i.e. USB 2.0) */ +#define MAX_REPORT_LENGTH 64 + +/** + * struct usb_hidg_report - response to GET_REPORT + * @report_id: report ID that this is a response for + * @userspace_req: + * !0 this report is used for any pending GET_REPORT request + * but wait on userspace to issue a new report on future requests + * 0 this report is to be used for any future GET_REPORT requests + * @length: length of the report response + * @data: report response + * @padding: padding for 32/64 bit compatibility + * + * Structure used by GADGET_HID_WRITE_GET_REPORT ioctl on /dev/hidg*. + */ +struct usb_hidg_report { + __u8 report_id; + __u8 userspace_req; + __u16 length; + __u8 data[MAX_REPORT_LENGTH]; + __u8 padding[4]; +}; + +/* The 'g' code is used by gadgetfs and hid gadget ioctl requests. + * Don't add any colliding codes to either driver, and keep + * them in unique ranges. + */ + +#define GADGET_HID_READ_GET_REPORT_ID _IOR('g', 0x41, __u8) +#define GADGET_HID_WRITE_GET_REPORT _IOW('g', 0x42, struct usb_hidg_report) + +#endif /* __UAPI_LINUX_USB_G_HID_H */ diff --git a/include/uapi/linux/usb/gadgetfs.h b/include/uapi/linux/usb/gadgetfs.h index 835473910a49..9754822b2a40 100644 --- a/include/uapi/linux/usb/gadgetfs.h +++ b/include/uapi/linux/usb/gadgetfs.h @@ -62,7 +62,7 @@ struct usb_gadgetfs_event { }; -/* The 'g' code is also used by printer gadget ioctl requests. +/* The 'g' code is also used by printer and hid gadget ioctl requests. * Don't add any colliding codes to either driver, and keep * them in unique ranges (size 0x20 for now). */ -- cgit v1.2.3 From 5ac86f0ed04bce41242167ffa12ad92038788a95 Mon Sep 17 00:00:00 2001 From: Kees Cook Date: Wed, 10 Jul 2024 16:15:55 -0700 Subject: virt: vbox: struct vmmdev_hgcm_pagelist: Replace 1-element array with flexible array Replace the deprecated[1] use of a 1-element array in struct vmmdev_hgcm_pagelist with a modern flexible array. As this is UAPI, we cannot trivially change the size of the struct, so use a union to retain the old first element's size, but switch "pages" to a flexible array. No binary differences are present after this conversion. Link: https://github.com/KSPP/linux/issues/79 [1] Reviewed-by: Gustavo A. R. Silva Link: https://lore.kernel.org/r/20240710231555.work.406-kees@kernel.org Signed-off-by: Kees Cook --- include/uapi/linux/vbox_vmmdev_types.h | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) (limited to 'include/uapi') diff --git a/include/uapi/linux/vbox_vmmdev_types.h b/include/uapi/linux/vbox_vmmdev_types.h index f8a8d6b3c521..6073858d52a2 100644 --- a/include/uapi/linux/vbox_vmmdev_types.h +++ b/include/uapi/linux/vbox_vmmdev_types.h @@ -282,7 +282,10 @@ struct vmmdev_hgcm_pagelist { __u32 flags; /** VMMDEV_HGCM_F_PARM_*. */ __u16 offset_first_page; /** Data offset in the first page. */ __u16 page_count; /** Number of pages. */ - __u64 pages[1]; /** Page addresses. */ + union { + __u64 unused; /** Deprecated place-holder for first "pages" entry. */ + __DECLARE_FLEX_ARRAY(__u64, pages); /** Page addresses. */ + }; }; VMMDEV_ASSERT_SIZE(vmmdev_hgcm_pagelist, 4 + 2 + 2 + 8); -- cgit v1.2.3 From 3849687869092094003ba009dc00e2e0237a3b8a Mon Sep 17 00:00:00 2001 From: Maxime Chevallier Date: Wed, 21 Aug 2024 17:09:55 +0200 Subject: net: phy: Introduce ethernet link topology representation Link topologies containing multiple network PHYs attached to the same net_device can be found when using a PHY as a media converter for use with an SFP connector, on which an SFP transceiver containing a PHY can be used. With the current model, the transceiver's PHY can't be used for operations such as cable testing, timestamping, macsec offload, etc. The reason being that most of the logic for these configuration, coming from either ethtool netlink or ioctls tend to use netdev->phydev, which in multi-phy systems will reference the PHY closest to the MAC. Introduce a numbering scheme allowing to enumerate PHY devices that belong to any netdev, which can in turn allow userspace to take more precise decisions with regard to each PHY's configuration. The numbering is maintained per-netdev, in a phy_device_list. The numbering works similarly to a netdevice's ifindex, with identifiers that are only recycled once INT_MAX has been reached. This prevents races that could occur between PHY listing and SFP transceiver removal/insertion. The identifiers are assigned at phy_attach time, as the numbering depends on the netdevice the phy is attached to. The PHY index can be re-used for PHYs that are persistent. Signed-off-by: Maxime Chevallier Reviewed-by: Christophe Leroy Tested-by: Christophe Leroy Signed-off-by: David S. Miller --- MAINTAINERS | 1 + drivers/net/phy/Makefile | 2 +- drivers/net/phy/phy_device.c | 6 +++ drivers/net/phy/phy_link_topology.c | 105 ++++++++++++++++++++++++++++++++++++ include/linux/netdevice.h | 4 +- include/linux/phy.h | 4 ++ include/linux/phy_link_topology.h | 82 ++++++++++++++++++++++++++++ include/uapi/linux/ethtool.h | 16 ++++++ net/core/dev.c | 15 ++++++ 9 files changed, 233 insertions(+), 2 deletions(-) create mode 100644 drivers/net/phy/phy_link_topology.c create mode 100644 include/linux/phy_link_topology.h (limited to 'include/uapi') diff --git a/MAINTAINERS b/MAINTAINERS index 5f45b75eba96..e4fa9010fcb6 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -8341,6 +8341,7 @@ F: include/linux/mii.h F: include/linux/of_net.h F: include/linux/phy.h F: include/linux/phy_fixed.h +F: include/linux/phy_link_topology.h F: include/linux/phylib_stubs.h F: include/linux/platform_data/mdio-bcm-unimac.h F: include/linux/platform_data/mdio-gpio.h diff --git a/drivers/net/phy/Makefile b/drivers/net/phy/Makefile index f086a606a87e..33adb3df5f64 100644 --- a/drivers/net/phy/Makefile +++ b/drivers/net/phy/Makefile @@ -2,7 +2,7 @@ # Makefile for Linux PHY drivers libphy-y := phy.o phy-c45.o phy-core.o phy_device.o \ - linkmode.o + linkmode.o phy_link_topology.o mdio-bus-y += mdio_bus.o mdio_device.o ifdef CONFIG_MDIO_DEVICE diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c index 159e71c1c972..d896c799f7c2 100644 --- a/drivers/net/phy/phy_device.c +++ b/drivers/net/phy/phy_device.c @@ -29,6 +29,7 @@ #include #include #include +#include #include #include #include @@ -1526,6 +1527,10 @@ int phy_attach_direct(struct net_device *dev, struct phy_device *phydev, if (phydev->sfp_bus_attached) dev->sfp_bus = phydev->sfp_bus; + + err = phy_link_topo_add_phy(dev, phydev, PHY_UPSTREAM_MAC, dev); + if (err) + goto error; } /* Some Ethernet drivers try to connect to a PHY device before @@ -1953,6 +1958,7 @@ void phy_detach(struct phy_device *phydev) if (dev) { phydev->attached_dev->phydev = NULL; phydev->attached_dev = NULL; + phy_link_topo_del_phy(dev, phydev); } phydev->phylink = NULL; diff --git a/drivers/net/phy/phy_link_topology.c b/drivers/net/phy/phy_link_topology.c new file mode 100644 index 000000000000..4a5d73002a1a --- /dev/null +++ b/drivers/net/phy/phy_link_topology.c @@ -0,0 +1,105 @@ +// SPDX-License-Identifier: GPL-2.0+ +/* + * Infrastructure to handle all PHY devices connected to a given netdev, + * either directly or indirectly attached. + * + * Copyright (c) 2023 Maxime Chevallier + */ + +#include +#include +#include +#include + +static int netdev_alloc_phy_link_topology(struct net_device *dev) +{ + struct phy_link_topology *topo; + + topo = kzalloc(sizeof(*topo), GFP_KERNEL); + if (!topo) + return -ENOMEM; + + xa_init_flags(&topo->phys, XA_FLAGS_ALLOC1); + topo->next_phy_index = 1; + + dev->link_topo = topo; + + return 0; +} + +int phy_link_topo_add_phy(struct net_device *dev, + struct phy_device *phy, + enum phy_upstream upt, void *upstream) +{ + struct phy_link_topology *topo = dev->link_topo; + struct phy_device_node *pdn; + int ret; + + if (!topo) { + ret = netdev_alloc_phy_link_topology(dev); + if (ret) + return ret; + + topo = dev->link_topo; + } + + pdn = kzalloc(sizeof(*pdn), GFP_KERNEL); + if (!pdn) + return -ENOMEM; + + pdn->phy = phy; + switch (upt) { + case PHY_UPSTREAM_MAC: + pdn->upstream.netdev = (struct net_device *)upstream; + if (phy_on_sfp(phy)) + pdn->parent_sfp_bus = pdn->upstream.netdev->sfp_bus; + break; + case PHY_UPSTREAM_PHY: + pdn->upstream.phydev = (struct phy_device *)upstream; + if (phy_on_sfp(phy)) + pdn->parent_sfp_bus = pdn->upstream.phydev->sfp_bus; + break; + default: + ret = -EINVAL; + goto err; + } + pdn->upstream_type = upt; + + /* Attempt to re-use a previously allocated phy_index */ + if (phy->phyindex) + ret = xa_insert(&topo->phys, phy->phyindex, pdn, GFP_KERNEL); + else + ret = xa_alloc_cyclic(&topo->phys, &phy->phyindex, pdn, + xa_limit_32b, &topo->next_phy_index, + GFP_KERNEL); + + if (ret) + goto err; + + return 0; + +err: + kfree(pdn); + return ret; +} +EXPORT_SYMBOL_GPL(phy_link_topo_add_phy); + +void phy_link_topo_del_phy(struct net_device *dev, + struct phy_device *phy) +{ + struct phy_link_topology *topo = dev->link_topo; + struct phy_device_node *pdn; + + if (!topo) + return; + + pdn = xa_erase(&topo->phys, phy->phyindex); + + /* We delete the PHY from the topology, however we don't re-set the + * phy->phyindex field. If the PHY isn't gone, we can re-assign it the + * same index next time it's added back to the topology + */ + + kfree(pdn); +} +EXPORT_SYMBOL_GPL(phy_link_topo_del_phy); diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 614ec5d3d75b..289a6133769c 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -40,7 +40,6 @@ #include #endif #include - #include #include #include @@ -81,6 +80,7 @@ struct xdp_frame; struct xdp_metadata_ops; struct xdp_md; struct ethtool_netdev_state; +struct phy_link_topology; typedef u32 xdp_features_t; @@ -1951,6 +1951,7 @@ enum netdev_reg_state { * @fcoe_ddp_xid: Max exchange id for FCoE LRO by ddp * * @priomap: XXX: need comments on this one + * @link_topo: Physical link topology tracking attached PHYs * @phydev: Physical device may attach itself * for hardware timestamping * @sfp_bus: attached &struct sfp_bus structure. @@ -2342,6 +2343,7 @@ struct net_device { #if IS_ENABLED(CONFIG_CGROUP_NET_PRIO) struct netprio_map __rcu *priomap; #endif + struct phy_link_topology *link_topo; struct phy_device *phydev; struct sfp_bus *sfp_bus; struct lock_class_key *qdisc_tx_busylock; diff --git a/include/linux/phy.h b/include/linux/phy.h index 6b7d40d49129..3b0f4bf80650 100644 --- a/include/linux/phy.h +++ b/include/linux/phy.h @@ -554,6 +554,9 @@ struct macsec_ops; * @drv: Pointer to the driver for this PHY instance * @devlink: Create a link between phy dev and mac dev, if the external phy * used by current mac interface is managed by another mac interface. + * @phyindex: Unique id across the phy's parent tree of phys to address the PHY + * from userspace, similar to ifindex. A zero index means the PHY + * wasn't assigned an id yet. * @phy_id: UID for this device found during discovery * @c45_ids: 802.3-c45 Device Identifiers if is_c45. * @is_c45: Set to true if this PHY uses clause 45 addressing. @@ -656,6 +659,7 @@ struct phy_device { struct device_link *devlink; + u32 phyindex; u32 phy_id; struct phy_c45_device_ids c45_ids; diff --git a/include/linux/phy_link_topology.h b/include/linux/phy_link_topology.h new file mode 100644 index 000000000000..68a59e25821c --- /dev/null +++ b/include/linux/phy_link_topology.h @@ -0,0 +1,82 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * PHY device list allow maintaining a list of PHY devices that are + * part of a netdevice's link topology. PHYs can for example be chained, + * as is the case when using a PHY that exposes an SFP module, on which an + * SFP transceiver that embeds a PHY is connected. + * + * This list can then be used by userspace to leverage individual PHY + * capabilities. + */ +#ifndef __PHY_LINK_TOPOLOGY_H +#define __PHY_LINK_TOPOLOGY_H + +#include +#include + +struct xarray; +struct phy_device; +struct sfp_bus; + +struct phy_link_topology { + struct xarray phys; + u32 next_phy_index; +}; + +struct phy_device_node { + enum phy_upstream upstream_type; + + union { + struct net_device *netdev; + struct phy_device *phydev; + } upstream; + + struct sfp_bus *parent_sfp_bus; + + struct phy_device *phy; +}; + +#if IS_ENABLED(CONFIG_PHYLIB) +int phy_link_topo_add_phy(struct net_device *dev, + struct phy_device *phy, + enum phy_upstream upt, void *upstream); + +void phy_link_topo_del_phy(struct net_device *dev, struct phy_device *phy); + +static inline struct phy_device * +phy_link_topo_get_phy(struct net_device *dev, u32 phyindex) +{ + struct phy_link_topology *topo = dev->link_topo; + struct phy_device_node *pdn; + + if (!topo) + return NULL; + + pdn = xa_load(&topo->phys, phyindex); + if (pdn) + return pdn->phy; + + return NULL; +} + +#else +static inline int phy_link_topo_add_phy(struct net_device *dev, + struct phy_device *phy, + enum phy_upstream upt, void *upstream) +{ + return 0; +} + +static inline void phy_link_topo_del_phy(struct net_device *dev, + struct phy_device *phy) +{ +} + +static inline struct phy_device * +phy_link_topo_get_phy(struct net_device *dev, u32 phyindex) +{ + return NULL; +} +#endif + +#endif /* __PHY_LINK_TOPOLOGY_H */ diff --git a/include/uapi/linux/ethtool.h b/include/uapi/linux/ethtool.h index 4a0a6e703483..c405ed63acfa 100644 --- a/include/uapi/linux/ethtool.h +++ b/include/uapi/linux/ethtool.h @@ -2533,4 +2533,20 @@ struct ethtool_link_settings { * __u32 map_lp_advertising[link_mode_masks_nwords]; */ }; + +/** + * enum phy_upstream - Represents the upstream component a given PHY device + * is connected to, as in what is on the other end of the MII bus. Most PHYs + * will be attached to an Ethernet MAC controller, but in some cases, there's + * an intermediate PHY used as a media-converter, which will driver another + * MII interface as its output. + * @PHY_UPSTREAM_MAC: Upstream component is a MAC (a switch port, + * or ethernet controller) + * @PHY_UPSTREAM_PHY: Upstream component is a PHY (likely a media converter) + */ +enum phy_upstream { + PHY_UPSTREAM_MAC, + PHY_UPSTREAM_PHY, +}; + #endif /* _UAPI_LINUX_ETHTOOL_H */ diff --git a/net/core/dev.c b/net/core/dev.c index e7260889d4cb..d0e7483abdcd 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -158,6 +158,7 @@ #include #include #include +#include #include "dev.h" #include "net-sysfs.h" @@ -10321,6 +10322,17 @@ static void netdev_do_free_pcpu_stats(struct net_device *dev) } } +static void netdev_free_phy_link_topology(struct net_device *dev) +{ + struct phy_link_topology *topo = dev->link_topo; + + if (IS_ENABLED(CONFIG_PHYLIB) && topo) { + xa_destroy(&topo->phys); + kfree(topo); + dev->link_topo = NULL; + } +} + /** * register_netdevice() - register a network device * @dev: device to register @@ -11099,6 +11111,7 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name, #ifdef CONFIG_NET_SCHED hash_init(dev->qdisc_hash); #endif + dev->priv_flags = IFF_XMIT_DST_RELEASE | IFF_XMIT_DST_RELEASE_PERM; setup(dev); @@ -11191,6 +11204,8 @@ void free_netdev(struct net_device *dev) free_percpu(dev->xdp_bulkq); dev->xdp_bulkq = NULL; + netdev_free_phy_link_topology(dev); + /* Compatibility with error handling in drivers */ if (dev->reg_state == NETREG_UNINITIALIZED || dev->reg_state == NETREG_DUMMY) { -- cgit v1.2.3 From c15e065b46dc4e19837275b826c1960d55564abd Mon Sep 17 00:00:00 2001 From: Maxime Chevallier Date: Wed, 21 Aug 2024 17:09:59 +0200 Subject: net: ethtool: Allow passing a phy index for some commands Some netlink commands are target towards ethernet PHYs, to control some of their features. As there's several such commands, add the ability to pass a PHY index in the ethnl request, which will populate the generic ethnl_req_info with the passed phy_index. Add a helper that netlink command handlers need to use to grab the targeted PHY from the req_info. This helper needs to hold rtnl_lock() while interacting with the PHY, as it may be removed at any point. Signed-off-by: Maxime Chevallier Reviewed-by: Christophe Leroy Tested-by: Christophe Leroy Signed-off-by: David S. Miller --- Documentation/networking/ethtool-netlink.rst | 7 ++++ include/uapi/linux/ethtool_netlink.h | 1 + net/ethtool/netlink.c | 57 +++++++++++++++++++++++++++- net/ethtool/netlink.h | 28 ++++++++++++++ 4 files changed, 91 insertions(+), 2 deletions(-) (limited to 'include/uapi') diff --git a/Documentation/networking/ethtool-netlink.rst b/Documentation/networking/ethtool-netlink.rst index 3f6c6880e7c4..2d14b8551348 100644 --- a/Documentation/networking/ethtool-netlink.rst +++ b/Documentation/networking/ethtool-netlink.rst @@ -57,6 +57,7 @@ Structure of this header is ``ETHTOOL_A_HEADER_DEV_INDEX`` u32 device ifindex ``ETHTOOL_A_HEADER_DEV_NAME`` string device name ``ETHTOOL_A_HEADER_FLAGS`` u32 flags common for all requests + ``ETHTOOL_A_HEADER_PHY_INDEX`` u32 phy device index ============================== ====== ============================= ``ETHTOOL_A_HEADER_DEV_INDEX`` and ``ETHTOOL_A_HEADER_DEV_NAME`` identify the @@ -81,6 +82,12 @@ the behaviour is backward compatible, i.e. requests from old clients not aware of the flag should be interpreted the way the client expects. A client must not set flags it does not understand. +``ETHTOOL_A_HEADER_PHY_INDEX`` identifies the Ethernet PHY the message relates to. +As there are numerous commands that are related to PHY configuration, and because +there may be more than one PHY on the link, the PHY index can be passed in the +request for the commands that needs it. It is, however, not mandatory, and if it +is not passed for commands that target a PHY, the net_device.phydev pointer +is used. Bit sets ======== diff --git a/include/uapi/linux/ethtool_netlink.h b/include/uapi/linux/ethtool_netlink.h index 9074fa309bd6..49d1f9220fde 100644 --- a/include/uapi/linux/ethtool_netlink.h +++ b/include/uapi/linux/ethtool_netlink.h @@ -134,6 +134,7 @@ enum { ETHTOOL_A_HEADER_DEV_INDEX, /* u32 */ ETHTOOL_A_HEADER_DEV_NAME, /* string */ ETHTOOL_A_HEADER_FLAGS, /* u32 - ETHTOOL_FLAG_* */ + ETHTOOL_A_HEADER_PHY_INDEX, /* u32 */ /* add new constants above here */ __ETHTOOL_A_HEADER_CNT, diff --git a/net/ethtool/netlink.c b/net/ethtool/netlink.c index 041548e5f5e6..b00061924e80 100644 --- a/net/ethtool/netlink.c +++ b/net/ethtool/netlink.c @@ -2,6 +2,7 @@ #include #include +#include #include #include "netlink.h" #include "module_fw.h" @@ -31,6 +32,24 @@ const struct nla_policy ethnl_header_policy_stats[] = { ETHTOOL_FLAGS_STATS), }; +const struct nla_policy ethnl_header_policy_phy[] = { + [ETHTOOL_A_HEADER_DEV_INDEX] = { .type = NLA_U32 }, + [ETHTOOL_A_HEADER_DEV_NAME] = { .type = NLA_NUL_STRING, + .len = ALTIFNAMSIZ - 1 }, + [ETHTOOL_A_HEADER_FLAGS] = NLA_POLICY_MASK(NLA_U32, + ETHTOOL_FLAGS_BASIC), + [ETHTOOL_A_HEADER_PHY_INDEX] = NLA_POLICY_MIN(NLA_U32, 1), +}; + +const struct nla_policy ethnl_header_policy_phy_stats[] = { + [ETHTOOL_A_HEADER_DEV_INDEX] = { .type = NLA_U32 }, + [ETHTOOL_A_HEADER_DEV_NAME] = { .type = NLA_NUL_STRING, + .len = ALTIFNAMSIZ - 1 }, + [ETHTOOL_A_HEADER_FLAGS] = NLA_POLICY_MASK(NLA_U32, + ETHTOOL_FLAGS_STATS), + [ETHTOOL_A_HEADER_PHY_INDEX] = NLA_POLICY_MIN(NLA_U32, 1), +}; + int ethnl_sock_priv_set(struct sk_buff *skb, struct net_device *dev, u32 portid, enum ethnl_sock_type type) { @@ -119,7 +138,7 @@ int ethnl_parse_header_dev_get(struct ethnl_req_info *req_info, const struct nlattr *header, struct net *net, struct netlink_ext_ack *extack, bool require_dev) { - struct nlattr *tb[ARRAY_SIZE(ethnl_header_policy)]; + struct nlattr *tb[ARRAY_SIZE(ethnl_header_policy_phy)]; const struct nlattr *devname_attr; struct net_device *dev = NULL; u32 flags = 0; @@ -134,7 +153,7 @@ int ethnl_parse_header_dev_get(struct ethnl_req_info *req_info, /* No validation here, command policy should have a nested policy set * for the header, therefore validation should have already been done. */ - ret = nla_parse_nested(tb, ARRAY_SIZE(ethnl_header_policy) - 1, header, + ret = nla_parse_nested(tb, ARRAY_SIZE(ethnl_header_policy_phy) - 1, header, NULL, extack); if (ret < 0) return ret; @@ -175,11 +194,45 @@ int ethnl_parse_header_dev_get(struct ethnl_req_info *req_info, return -EINVAL; } + if (tb[ETHTOOL_A_HEADER_PHY_INDEX]) { + if (dev) { + req_info->phy_index = nla_get_u32(tb[ETHTOOL_A_HEADER_PHY_INDEX]); + } else { + NL_SET_ERR_MSG_ATTR(extack, header, + "phy_index set without a netdev"); + return -EINVAL; + } + } + req_info->dev = dev; req_info->flags = flags; return 0; } +struct phy_device *ethnl_req_get_phydev(const struct ethnl_req_info *req_info, + const struct nlattr *header, + struct netlink_ext_ack *extack) +{ + struct phy_device *phydev; + + ASSERT_RTNL(); + + if (!req_info->dev) + return NULL; + + if (!req_info->phy_index) + return req_info->dev->phydev; + + phydev = phy_link_topo_get_phy(req_info->dev, req_info->phy_index); + if (!phydev) { + NL_SET_ERR_MSG_ATTR(extack, header, + "no phy matching phyindex"); + return ERR_PTR(-ENODEV); + } + + return phydev; +} + /** * ethnl_fill_reply_header() - Put common header into a reply message * @skb: skb with the message diff --git a/net/ethtool/netlink.h b/net/ethtool/netlink.h index 236c189fc968..e326699405be 100644 --- a/net/ethtool/netlink.h +++ b/net/ethtool/netlink.h @@ -251,6 +251,9 @@ static inline unsigned int ethnl_reply_header_size(void) * @dev: network device the request is for (may be null) * @dev_tracker: refcount tracker for @dev reference * @flags: request flags common for all request types + * @phy_index: phy_device index connected to @dev this request is for. Can be + * 0 if the request doesn't target a phy, or if the @dev's attached + * phy is targeted. * * This is a common base for request specific structures holding data from * parsed userspace request. These always embed struct ethnl_req_info at @@ -260,6 +263,7 @@ struct ethnl_req_info { struct net_device *dev; netdevice_tracker dev_tracker; u32 flags; + u32 phy_index; }; static inline void ethnl_parse_header_dev_put(struct ethnl_req_info *req_info) @@ -267,6 +271,27 @@ static inline void ethnl_parse_header_dev_put(struct ethnl_req_info *req_info) netdev_put(req_info->dev, &req_info->dev_tracker); } +/** + * ethnl_req_get_phydev() - Gets the phy_device targeted by this request, + * if any. Must be called under rntl_lock(). + * @req_info: The ethnl request to get the phy from. + * @header: The netlink header, used for error reporting. + * @extack: The netlink extended ACK, for error reporting. + * + * The caller must hold RTNL, until it's done interacting with the returned + * phy_device. + * + * Return: A phy_device pointer corresponding either to the passed phy_index + * if one is provided. If not, the phy_device attached to the + * net_device targeted by this request is returned. If there's no + * targeted net_device, or no phy_device is attached, NULL is + * returned. If the provided phy_index is invalid, an error pointer + * is returned. + */ +struct phy_device *ethnl_req_get_phydev(const struct ethnl_req_info *req_info, + const struct nlattr *header, + struct netlink_ext_ack *extack); + /** * struct ethnl_reply_data - base type of reply data for GET requests * @dev: device for current reply message; in single shot requests it is @@ -409,9 +434,12 @@ extern const struct ethnl_request_ops ethnl_rss_request_ops; extern const struct ethnl_request_ops ethnl_plca_cfg_request_ops; extern const struct ethnl_request_ops ethnl_plca_status_request_ops; extern const struct ethnl_request_ops ethnl_mm_request_ops; +extern const struct ethnl_request_ops ethnl_phy_request_ops; extern const struct nla_policy ethnl_header_policy[ETHTOOL_A_HEADER_FLAGS + 1]; extern const struct nla_policy ethnl_header_policy_stats[ETHTOOL_A_HEADER_FLAGS + 1]; +extern const struct nla_policy ethnl_header_policy_phy[ETHTOOL_A_HEADER_PHY_INDEX + 1]; +extern const struct nla_policy ethnl_header_policy_phy_stats[ETHTOOL_A_HEADER_PHY_INDEX + 1]; extern const struct nla_policy ethnl_strset_get_policy[ETHTOOL_A_STRSET_COUNTS_ONLY + 1]; extern const struct nla_policy ethnl_linkinfo_get_policy[ETHTOOL_A_LINKINFO_HEADER + 1]; extern const struct nla_policy ethnl_linkinfo_set_policy[ETHTOOL_A_LINKINFO_TP_MDIX_CTRL + 1]; -- cgit v1.2.3 From 17194be4c8e1e82d8b484e58cdcb495c0714d1fd Mon Sep 17 00:00:00 2001 From: Maxime Chevallier Date: Wed, 21 Aug 2024 17:10:01 +0200 Subject: net: ethtool: Introduce a command to list PHYs on an interface As we have the ability to track the PHYs connected to a net_device through the link_topology, we can expose this list to userspace. This allows userspace to use these identifiers for phy-specific commands and take the decision of which PHY to target by knowing the link topology. Add PHY_GET and PHY_DUMP, which can be a filtered DUMP operation to list devices on only one interface. Signed-off-by: Maxime Chevallier Reviewed-by: Christophe Leroy Tested-by: Christophe Leroy Signed-off-by: David S. Miller --- Documentation/networking/ethtool-netlink.rst | 41 ++++ include/uapi/linux/ethtool_netlink.h | 19 ++ net/ethtool/Makefile | 3 +- net/ethtool/netlink.c | 9 + net/ethtool/netlink.h | 5 + net/ethtool/phy.c | 308 +++++++++++++++++++++++++++ 6 files changed, 384 insertions(+), 1 deletion(-) create mode 100644 net/ethtool/phy.c (limited to 'include/uapi') diff --git a/Documentation/networking/ethtool-netlink.rst b/Documentation/networking/ethtool-netlink.rst index 2d14b8551348..8c152871c23c 100644 --- a/Documentation/networking/ethtool-netlink.rst +++ b/Documentation/networking/ethtool-netlink.rst @@ -2191,6 +2191,46 @@ string. The ``ETHTOOL_A_MODULE_FW_FLASH_DONE`` and ``ETHTOOL_A_MODULE_FW_FLASH_TOTAL`` attributes encode the completed and total amount of work, respectively. +PHY_GET +======= + +Retrieve information about a given Ethernet PHY sitting on the link. The DO +operation returns all available information about dev->phydev. User can also +specify a PHY_INDEX, in which case the DO request returns information about that +specific PHY. +As there can be more than one PHY, the DUMP operation can be used to list the PHYs +present on a given interface, by passing an interface index or name in +the dump request. + +Request contents: + + ==================================== ====== ========================== + ``ETHTOOL_A_PHY_HEADER`` nested request header + ==================================== ====== ========================== + +Kernel response contents: + + ===================================== ====== =============================== + ``ETHTOOL_A_PHY_HEADER`` nested request header + ``ETHTOOL_A_PHY_INDEX`` u32 the phy's unique index, that can + be used for phy-specific + requests + ``ETHTOOL_A_PHY_DRVNAME`` string the phy driver name + ``ETHTOOL_A_PHY_NAME`` string the phy device name + ``ETHTOOL_A_PHY_UPSTREAM_TYPE`` u32 the type of device this phy is + connected to + ``ETHTOOL_A_PHY_UPSTREAM_INDEX`` u32 the PHY index of the upstream + PHY + ``ETHTOOL_A_PHY_UPSTREAM_SFP_NAME`` string if this PHY is connected to + its parent PHY through an SFP + bus, the name of this sfp bus + ``ETHTOOL_A_PHY_DOWNSTREAM_SFP_NAME`` string if the phy controls an sfp bus, + the name of the sfp bus + ===================================== ====== =============================== + +When ``ETHTOOL_A_PHY_UPSTREAM_TYPE`` is PHY_UPSTREAM_PHY, the PHY's parent is +another PHY. + Request translation =================== @@ -2298,4 +2338,5 @@ are netlink only. n/a ``ETHTOOL_MSG_MM_GET`` n/a ``ETHTOOL_MSG_MM_SET`` n/a ``ETHTOOL_MSG_MODULE_FW_FLASH_ACT`` + n/a ``ETHTOOL_MSG_PHY_GET`` =================================== ===================================== diff --git a/include/uapi/linux/ethtool_netlink.h b/include/uapi/linux/ethtool_netlink.h index 49d1f9220fde..45d8bcdea056 100644 --- a/include/uapi/linux/ethtool_netlink.h +++ b/include/uapi/linux/ethtool_netlink.h @@ -58,6 +58,7 @@ enum { ETHTOOL_MSG_MM_GET, ETHTOOL_MSG_MM_SET, ETHTOOL_MSG_MODULE_FW_FLASH_ACT, + ETHTOOL_MSG_PHY_GET, /* add new constants above here */ __ETHTOOL_MSG_USER_CNT, @@ -111,6 +112,8 @@ enum { ETHTOOL_MSG_MM_GET_REPLY, ETHTOOL_MSG_MM_NTF, ETHTOOL_MSG_MODULE_FW_FLASH_NTF, + ETHTOOL_MSG_PHY_GET_REPLY, + ETHTOOL_MSG_PHY_NTF, /* add new constants above here */ __ETHTOOL_MSG_KERNEL_CNT, @@ -1055,6 +1058,22 @@ enum { ETHTOOL_A_MODULE_FW_FLASH_MAX = (__ETHTOOL_A_MODULE_FW_FLASH_CNT - 1) }; +enum { + ETHTOOL_A_PHY_UNSPEC, + ETHTOOL_A_PHY_HEADER, /* nest - _A_HEADER_* */ + ETHTOOL_A_PHY_INDEX, /* u32 */ + ETHTOOL_A_PHY_DRVNAME, /* string */ + ETHTOOL_A_PHY_NAME, /* string */ + ETHTOOL_A_PHY_UPSTREAM_TYPE, /* u32 */ + ETHTOOL_A_PHY_UPSTREAM_INDEX, /* u32 */ + ETHTOOL_A_PHY_UPSTREAM_SFP_NAME, /* string */ + ETHTOOL_A_PHY_DOWNSTREAM_SFP_NAME, /* string */ + + /* add new constants above here */ + __ETHTOOL_A_PHY_CNT, + ETHTOOL_A_PHY_MAX = (__ETHTOOL_A_PHY_CNT - 1) +}; + /* generic netlink info */ #define ETHTOOL_GENL_NAME "ethtool" #define ETHTOOL_GENL_VERSION 1 diff --git a/net/ethtool/Makefile b/net/ethtool/Makefile index 9a190635fe95..9b540644ba31 100644 --- a/net/ethtool/Makefile +++ b/net/ethtool/Makefile @@ -8,4 +8,5 @@ ethtool_nl-y := netlink.o bitset.o strset.o linkinfo.o linkmodes.o rss.o \ linkstate.o debug.o wol.o features.o privflags.o rings.o \ channels.o coalesce.o pause.o eee.o tsinfo.o cabletest.o \ tunnels.o fec.o eeprom.o stats.o phc_vclocks.o mm.o \ - module.o cmis_fw_update.o cmis_cdb.o pse-pd.o plca.o mm.o + module.o cmis_fw_update.o cmis_cdb.o pse-pd.o plca.o mm.o \ + phy.o diff --git a/net/ethtool/netlink.c b/net/ethtool/netlink.c index b00061924e80..e3f0ef6b851b 100644 --- a/net/ethtool/netlink.c +++ b/net/ethtool/netlink.c @@ -1234,6 +1234,15 @@ static const struct genl_ops ethtool_genl_ops[] = { .policy = ethnl_module_fw_flash_act_policy, .maxattr = ARRAY_SIZE(ethnl_module_fw_flash_act_policy) - 1, }, + { + .cmd = ETHTOOL_MSG_PHY_GET, + .doit = ethnl_phy_doit, + .start = ethnl_phy_start, + .dumpit = ethnl_phy_dumpit, + .done = ethnl_phy_done, + .policy = ethnl_phy_get_policy, + .maxattr = ARRAY_SIZE(ethnl_phy_get_policy) - 1, + }, }; static const struct genl_multicast_group ethtool_nl_mcgrps[] = { diff --git a/net/ethtool/netlink.h b/net/ethtool/netlink.h index e326699405be..203b08eb6c6f 100644 --- a/net/ethtool/netlink.h +++ b/net/ethtool/netlink.h @@ -484,6 +484,7 @@ extern const struct nla_policy ethnl_plca_get_status_policy[ETHTOOL_A_PLCA_HEADE extern const struct nla_policy ethnl_mm_get_policy[ETHTOOL_A_MM_HEADER + 1]; extern const struct nla_policy ethnl_mm_set_policy[ETHTOOL_A_MM_MAX + 1]; extern const struct nla_policy ethnl_module_fw_flash_act_policy[ETHTOOL_A_MODULE_FW_FLASH_PASSWORD + 1]; +extern const struct nla_policy ethnl_phy_get_policy[ETHTOOL_A_PHY_HEADER + 1]; int ethnl_set_features(struct sk_buff *skb, struct genl_info *info); int ethnl_act_cable_test(struct sk_buff *skb, struct genl_info *info); @@ -494,6 +495,10 @@ int ethnl_tunnel_info_dumpit(struct sk_buff *skb, struct netlink_callback *cb); int ethnl_act_module_fw_flash(struct sk_buff *skb, struct genl_info *info); int ethnl_rss_dump_start(struct netlink_callback *cb); int ethnl_rss_dumpit(struct sk_buff *skb, struct netlink_callback *cb); +int ethnl_phy_start(struct netlink_callback *cb); +int ethnl_phy_doit(struct sk_buff *skb, struct genl_info *info); +int ethnl_phy_dumpit(struct sk_buff *skb, struct netlink_callback *cb); +int ethnl_phy_done(struct netlink_callback *cb); extern const char stats_std_names[__ETHTOOL_STATS_CNT][ETH_GSTRING_LEN]; extern const char stats_eth_phy_names[__ETHTOOL_A_STATS_ETH_PHY_CNT][ETH_GSTRING_LEN]; diff --git a/net/ethtool/phy.c b/net/ethtool/phy.c new file mode 100644 index 000000000000..560dd039c662 --- /dev/null +++ b/net/ethtool/phy.c @@ -0,0 +1,308 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Copyright 2023 Bootlin + * + */ +#include "common.h" +#include "netlink.h" + +#include +#include +#include + +struct phy_req_info { + struct ethnl_req_info base; + struct phy_device_node *pdn; +}; + +#define PHY_REQINFO(__req_base) \ + container_of(__req_base, struct phy_req_info, base) + +const struct nla_policy ethnl_phy_get_policy[ETHTOOL_A_PHY_HEADER + 1] = { + [ETHTOOL_A_PHY_HEADER] = NLA_POLICY_NESTED(ethnl_header_policy), +}; + +/* Caller holds rtnl */ +static ssize_t +ethnl_phy_reply_size(const struct ethnl_req_info *req_base, + struct netlink_ext_ack *extack) +{ + struct phy_req_info *req_info = PHY_REQINFO(req_base); + struct phy_device_node *pdn = req_info->pdn; + struct phy_device *phydev = pdn->phy; + size_t size = 0; + + ASSERT_RTNL(); + + /* ETHTOOL_A_PHY_INDEX */ + size += nla_total_size(sizeof(u32)); + + /* ETHTOOL_A_DRVNAME */ + if (phydev->drv) + size += nla_total_size(strlen(phydev->drv->name) + 1); + + /* ETHTOOL_A_NAME */ + size += nla_total_size(strlen(dev_name(&phydev->mdio.dev)) + 1); + + /* ETHTOOL_A_PHY_UPSTREAM_TYPE */ + size += nla_total_size(sizeof(u32)); + + if (phy_on_sfp(phydev)) { + const char *upstream_sfp_name = sfp_get_name(pdn->parent_sfp_bus); + + /* ETHTOOL_A_PHY_UPSTREAM_SFP_NAME */ + if (upstream_sfp_name) + size += nla_total_size(strlen(upstream_sfp_name) + 1); + + /* ETHTOOL_A_PHY_UPSTREAM_INDEX */ + size += nla_total_size(sizeof(u32)); + } + + /* ETHTOOL_A_PHY_DOWNSTREAM_SFP_NAME */ + if (phydev->sfp_bus) { + const char *sfp_name = sfp_get_name(phydev->sfp_bus); + + if (sfp_name) + size += nla_total_size(strlen(sfp_name) + 1); + } + + return size; +} + +static int +ethnl_phy_fill_reply(const struct ethnl_req_info *req_base, struct sk_buff *skb) +{ + struct phy_req_info *req_info = PHY_REQINFO(req_base); + struct phy_device_node *pdn = req_info->pdn; + struct phy_device *phydev = pdn->phy; + enum phy_upstream ptype; + + ptype = pdn->upstream_type; + + if (nla_put_u32(skb, ETHTOOL_A_PHY_INDEX, phydev->phyindex) || + nla_put_string(skb, ETHTOOL_A_PHY_NAME, dev_name(&phydev->mdio.dev)) || + nla_put_u32(skb, ETHTOOL_A_PHY_UPSTREAM_TYPE, ptype)) + return -EMSGSIZE; + + if (phydev->drv && + nla_put_string(skb, ETHTOOL_A_PHY_DRVNAME, phydev->drv->name)) + return -EMSGSIZE; + + if (ptype == PHY_UPSTREAM_PHY) { + struct phy_device *upstream = pdn->upstream.phydev; + const char *sfp_upstream_name; + + /* Parent index */ + if (nla_put_u32(skb, ETHTOOL_A_PHY_UPSTREAM_INDEX, upstream->phyindex)) + return -EMSGSIZE; + + if (pdn->parent_sfp_bus) { + sfp_upstream_name = sfp_get_name(pdn->parent_sfp_bus); + if (sfp_upstream_name && + nla_put_string(skb, ETHTOOL_A_PHY_UPSTREAM_SFP_NAME, + sfp_upstream_name)) + return -EMSGSIZE; + } + } + + if (phydev->sfp_bus) { + const char *sfp_name = sfp_get_name(phydev->sfp_bus); + + if (sfp_name && + nla_put_string(skb, ETHTOOL_A_PHY_DOWNSTREAM_SFP_NAME, + sfp_name)) + return -EMSGSIZE; + } + + return 0; +} + +static int ethnl_phy_parse_request(struct ethnl_req_info *req_base, + struct nlattr **tb, + struct netlink_ext_ack *extack) +{ + struct phy_link_topology *topo = req_base->dev->link_topo; + struct phy_req_info *req_info = PHY_REQINFO(req_base); + struct phy_device *phydev; + + phydev = ethnl_req_get_phydev(req_base, tb[ETHTOOL_A_PHY_HEADER], + extack); + if (!phydev) + return 0; + + if (IS_ERR(phydev)) + return PTR_ERR(phydev); + + if (!topo) + return 0; + + req_info->pdn = xa_load(&topo->phys, phydev->phyindex); + + return 0; +} + +int ethnl_phy_doit(struct sk_buff *skb, struct genl_info *info) +{ + struct phy_req_info req_info = {}; + struct nlattr **tb = info->attrs; + struct sk_buff *rskb; + void *reply_payload; + int reply_len; + int ret; + + ret = ethnl_parse_header_dev_get(&req_info.base, + tb[ETHTOOL_A_PHY_HEADER], + genl_info_net(info), info->extack, + true); + if (ret < 0) + return ret; + + rtnl_lock(); + + ret = ethnl_phy_parse_request(&req_info.base, tb, info->extack); + if (ret < 0) + goto err_unlock_rtnl; + + /* No PHY, return early */ + if (!req_info.pdn->phy) + goto err_unlock_rtnl; + + ret = ethnl_phy_reply_size(&req_info.base, info->extack); + if (ret < 0) + goto err_unlock_rtnl; + reply_len = ret + ethnl_reply_header_size(); + + rskb = ethnl_reply_init(reply_len, req_info.base.dev, + ETHTOOL_MSG_PHY_GET_REPLY, + ETHTOOL_A_PHY_HEADER, + info, &reply_payload); + if (!rskb) { + ret = -ENOMEM; + goto err_unlock_rtnl; + } + + ret = ethnl_phy_fill_reply(&req_info.base, rskb); + if (ret) + goto err_free_msg; + + rtnl_unlock(); + ethnl_parse_header_dev_put(&req_info.base); + genlmsg_end(rskb, reply_payload); + + return genlmsg_reply(rskb, info); + +err_free_msg: + nlmsg_free(rskb); +err_unlock_rtnl: + rtnl_unlock(); + ethnl_parse_header_dev_put(&req_info.base); + return ret; +} + +struct ethnl_phy_dump_ctx { + struct phy_req_info *phy_req_info; + unsigned long ifindex; + unsigned long phy_index; +}; + +int ethnl_phy_start(struct netlink_callback *cb) +{ + const struct genl_info *info = genl_info_dump(cb); + struct ethnl_phy_dump_ctx *ctx = (void *)cb->ctx; + int ret; + + BUILD_BUG_ON(sizeof(*ctx) > sizeof(cb->ctx)); + + ctx->phy_req_info = kzalloc(sizeof(*ctx->phy_req_info), GFP_KERNEL); + if (!ctx->phy_req_info) + return -ENOMEM; + + ret = ethnl_parse_header_dev_get(&ctx->phy_req_info->base, + info->attrs[ETHTOOL_A_PHY_HEADER], + sock_net(cb->skb->sk), cb->extack, + false); + ctx->ifindex = 0; + ctx->phy_index = 0; + + if (ret) + kfree(ctx->phy_req_info); + + return ret; +} + +int ethnl_phy_done(struct netlink_callback *cb) +{ + struct ethnl_phy_dump_ctx *ctx = (void *)cb->ctx; + + if (ctx->phy_req_info->base.dev) + ethnl_parse_header_dev_put(&ctx->phy_req_info->base); + + kfree(ctx->phy_req_info); + + return 0; +} + +static int ethnl_phy_dump_one_dev(struct sk_buff *skb, struct net_device *dev, + struct netlink_callback *cb) +{ + struct ethnl_phy_dump_ctx *ctx = (void *)cb->ctx; + struct phy_req_info *pri = ctx->phy_req_info; + struct phy_device_node *pdn; + int ret = 0; + void *ehdr; + + pri->base.dev = dev; + + if (!dev->link_topo) + return 0; + + xa_for_each_start(&dev->link_topo->phys, ctx->phy_index, pdn, ctx->phy_index) { + ehdr = ethnl_dump_put(skb, cb, ETHTOOL_MSG_PHY_GET_REPLY); + if (!ehdr) { + ret = -EMSGSIZE; + break; + } + + ret = ethnl_fill_reply_header(skb, dev, ETHTOOL_A_PHY_HEADER); + if (ret < 0) { + genlmsg_cancel(skb, ehdr); + break; + } + + pri->pdn = pdn; + ret = ethnl_phy_fill_reply(&pri->base, skb); + if (ret < 0) { + genlmsg_cancel(skb, ehdr); + break; + } + + genlmsg_end(skb, ehdr); + } + + return ret; +} + +int ethnl_phy_dumpit(struct sk_buff *skb, struct netlink_callback *cb) +{ + struct ethnl_phy_dump_ctx *ctx = (void *)cb->ctx; + struct net *net = sock_net(skb->sk); + struct net_device *dev; + int ret = 0; + + rtnl_lock(); + + if (ctx->phy_req_info->base.dev) { + ret = ethnl_phy_dump_one_dev(skb, ctx->phy_req_info->base.dev, cb); + } else { + for_each_netdev_dump(net, dev, ctx->ifindex) { + ret = ethnl_phy_dump_one_dev(skb, dev, cb); + if (ret) + break; + + ctx->phy_index = 0; + } + } + rtnl_unlock(); + + return ret; +} -- cgit v1.2.3 From b0966c724584a5a9fd7fb529de19807c31f27a45 Mon Sep 17 00:00:00 2001 From: Dave Marchevsky Date: Tue, 13 Aug 2024 21:24:23 +0000 Subject: bpf: Support bpf_kptr_xchg into local kptr Currently, users can only stash kptr into map values with bpf_kptr_xchg(). This patch further supports stashing kptr into local kptr by adding local kptr as a valid destination type. When stashing into local kptr, btf_record in program BTF is used instead of btf_record in map to search for the btf_field of the local kptr. The local kptr specific checks in check_reg_type() only apply when the source argument of bpf_kptr_xchg() is local kptr. Therefore, we make the scope of the check explicit as the destination now can also be local kptr. Acked-by: Martin KaFai Lau Signed-off-by: Dave Marchevsky Signed-off-by: Amery Hung Link: https://lore.kernel.org/r/20240813212424.2871455-5-amery.hung@bytedance.com Signed-off-by: Alexei Starovoitov --- include/uapi/linux/bpf.h | 9 +++++---- kernel/bpf/helpers.c | 4 ++-- kernel/bpf/verifier.c | 44 ++++++++++++++++++++++++++++++-------------- 3 files changed, 37 insertions(+), 20 deletions(-) (limited to 'include/uapi') diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 35bcf52dbc65..e2629457d72d 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -5519,11 +5519,12 @@ union bpf_attr { * **-EOPNOTSUPP** if the hash calculation failed or **-EINVAL** if * invalid arguments are passed. * - * void *bpf_kptr_xchg(void *map_value, void *ptr) + * void *bpf_kptr_xchg(void *dst, void *ptr) * Description - * Exchange kptr at pointer *map_value* with *ptr*, and return the - * old value. *ptr* can be NULL, otherwise it must be a referenced - * pointer which will be released when this helper is called. + * Exchange kptr at pointer *dst* with *ptr*, and return the old value. + * *dst* can be map value or local kptr. *ptr* can be NULL, otherwise + * it must be a referenced pointer which will be released when this helper + * is called. * Return * The old value of kptr (which can be NULL). The returned pointer * if not NULL, is a reference which must be released using its diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c index bf55b81b3e06..f12f075b70c5 100644 --- a/kernel/bpf/helpers.c +++ b/kernel/bpf/helpers.c @@ -1619,9 +1619,9 @@ void bpf_wq_cancel_and_free(void *val) schedule_work(&work->delete_work); } -BPF_CALL_2(bpf_kptr_xchg, void *, map_value, void *, ptr) +BPF_CALL_2(bpf_kptr_xchg, void *, dst, void *, ptr) { - unsigned long *kptr = map_value; + unsigned long *kptr = dst; /* This helper may be inlined by verifier. */ return xchg(kptr, (unsigned long)ptr); diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 5fe2a8978bb8..33270b363689 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -7803,29 +7803,38 @@ static int process_kptr_func(struct bpf_verifier_env *env, int regno, struct bpf_call_arg_meta *meta) { struct bpf_reg_state *regs = cur_regs(env), *reg = ®s[regno]; - struct bpf_map *map_ptr = reg->map_ptr; struct btf_field *kptr_field; + struct bpf_map *map_ptr; + struct btf_record *rec; u32 kptr_off; + if (type_is_ptr_alloc_obj(reg->type)) { + rec = reg_btf_record(reg); + } else { /* PTR_TO_MAP_VALUE */ + map_ptr = reg->map_ptr; + if (!map_ptr->btf) { + verbose(env, "map '%s' has to have BTF in order to use bpf_kptr_xchg\n", + map_ptr->name); + return -EINVAL; + } + rec = map_ptr->record; + meta->map_ptr = map_ptr; + } + if (!tnum_is_const(reg->var_off)) { verbose(env, "R%d doesn't have constant offset. kptr has to be at the constant offset\n", regno); return -EINVAL; } - if (!map_ptr->btf) { - verbose(env, "map '%s' has to have BTF in order to use bpf_kptr_xchg\n", - map_ptr->name); - return -EINVAL; - } - if (!btf_record_has_field(map_ptr->record, BPF_KPTR)) { - verbose(env, "map '%s' has no valid kptr\n", map_ptr->name); + + if (!btf_record_has_field(rec, BPF_KPTR)) { + verbose(env, "R%d has no valid kptr\n", regno); return -EINVAL; } - meta->map_ptr = map_ptr; kptr_off = reg->off + reg->var_off.value; - kptr_field = btf_record_find(map_ptr->record, kptr_off, BPF_KPTR); + kptr_field = btf_record_find(rec, kptr_off, BPF_KPTR); if (!kptr_field) { verbose(env, "off=%d doesn't point to kptr\n", kptr_off); return -EACCES; @@ -8412,7 +8421,12 @@ static const struct bpf_reg_types func_ptr_types = { .types = { PTR_TO_FUNC } }; static const struct bpf_reg_types stack_ptr_types = { .types = { PTR_TO_STACK } }; static const struct bpf_reg_types const_str_ptr_types = { .types = { PTR_TO_MAP_VALUE } }; static const struct bpf_reg_types timer_types = { .types = { PTR_TO_MAP_VALUE } }; -static const struct bpf_reg_types kptr_xchg_dest_types = { .types = { PTR_TO_MAP_VALUE } }; +static const struct bpf_reg_types kptr_xchg_dest_types = { + .types = { + PTR_TO_MAP_VALUE, + PTR_TO_BTF_ID | MEM_ALLOC + } +}; static const struct bpf_reg_types dynptr_types = { .types = { PTR_TO_STACK, @@ -8483,7 +8497,8 @@ static int check_reg_type(struct bpf_verifier_env *env, u32 regno, if (base_type(arg_type) == ARG_PTR_TO_MEM) type &= ~DYNPTR_TYPE_FLAG_MASK; - if (meta->func_id == BPF_FUNC_kptr_xchg && type_is_alloc(type)) { + /* Local kptr types are allowed as the source argument of bpf_kptr_xchg */ + if (meta->func_id == BPF_FUNC_kptr_xchg && type_is_alloc(type) && regno == BPF_REG_2) { type &= ~MEM_ALLOC; type &= ~MEM_PERCPU; } @@ -8576,7 +8591,8 @@ found: verbose(env, "verifier internal error: unimplemented handling of MEM_ALLOC\n"); return -EFAULT; } - if (meta->func_id == BPF_FUNC_kptr_xchg) { + /* Check if local kptr in src arg matches kptr in dst arg */ + if (meta->func_id == BPF_FUNC_kptr_xchg && regno == BPF_REG_2) { if (map_kptr_match_type(env, meta->kptr_field, reg, regno)) return -EACCES; } @@ -8887,7 +8903,7 @@ skip_type_check: meta->release_regno = regno; } - if (reg->ref_obj_id) { + if (reg->ref_obj_id && base_type(arg_type) != ARG_KPTR_XCHG_DEST) { if (meta->ref_obj_id) { verbose(env, "verifier internal error: more than one arg with ref_obj_id R%d %u %u\n", regno, reg->ref_obj_id, -- cgit v1.2.3 From 65ab5ac4df012388481d0414fcac1d5ac1721fb3 Mon Sep 17 00:00:00 2001 From: Jordan Rome Date: Fri, 23 Aug 2024 12:51:00 -0700 Subject: bpf: Add bpf_copy_from_user_str kfunc This adds a kfunc wrapper around strncpy_from_user, which can be called from sleepable BPF programs. This matches the non-sleepable 'bpf_probe_read_user_str' helper except it includes an additional 'flags' param, which allows consumers to clear the entire destination buffer on success or failure. Signed-off-by: Jordan Rome Link: https://lore.kernel.org/r/20240823195101.3621028-1-linux@jordanrome.com Signed-off-by: Alexei Starovoitov --- include/uapi/linux/bpf.h | 9 +++++++++ kernel/bpf/helpers.c | 42 ++++++++++++++++++++++++++++++++++++++++++ tools/include/uapi/linux/bpf.h | 9 +++++++++ 3 files changed, 60 insertions(+) (limited to 'include/uapi') diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index e2629457d72d..c3a5728db115 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -7513,4 +7513,13 @@ struct bpf_iter_num { __u64 __opaque[1]; } __attribute__((aligned(8))); +/* + * Flags to control BPF kfunc behaviour. + * - BPF_F_PAD_ZEROS: Pad destination buffer with zeros. (See the respective + * helper documentation for details.) + */ +enum bpf_kfunc_flags { + BPF_F_PAD_ZEROS = (1ULL << 0), +}; + #endif /* _UAPI__LINUX_BPF_H__ */ diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c index f12f075b70c5..7e7761c537f8 100644 --- a/kernel/bpf/helpers.c +++ b/kernel/bpf/helpers.c @@ -2962,6 +2962,47 @@ __bpf_kfunc void bpf_iter_bits_destroy(struct bpf_iter_bits *it) bpf_mem_free(&bpf_global_ma, kit->bits); } +/** + * bpf_copy_from_user_str() - Copy a string from an unsafe user address + * @dst: Destination address, in kernel space. This buffer must be + * at least @dst__sz bytes long. + * @dst__sz: Maximum number of bytes to copy, includes the trailing NUL. + * @unsafe_ptr__ign: Source address, in user space. + * @flags: The only supported flag is BPF_F_PAD_ZEROS + * + * Copies a NUL-terminated string from userspace to BPF space. If user string is + * too long this will still ensure zero termination in the dst buffer unless + * buffer size is 0. + * + * If BPF_F_PAD_ZEROS flag is set, memset the tail of @dst to 0 on success and + * memset all of @dst on failure. + */ +__bpf_kfunc int bpf_copy_from_user_str(void *dst, u32 dst__sz, const void __user *unsafe_ptr__ign, u64 flags) +{ + int ret; + + if (unlikely(flags & ~BPF_F_PAD_ZEROS)) + return -EINVAL; + + if (unlikely(!dst__sz)) + return 0; + + ret = strncpy_from_user(dst, unsafe_ptr__ign, dst__sz - 1); + if (ret < 0) { + if (flags & BPF_F_PAD_ZEROS) + memset((char *)dst, 0, dst__sz); + + return ret; + } + + if (flags & BPF_F_PAD_ZEROS) + memset((char *)dst + ret, 0, dst__sz - ret); + else + ((char *)dst)[ret] = '\0'; + + return ret + 1; +} + __bpf_kfunc_end_defs(); BTF_KFUNCS_START(generic_btf_ids) @@ -3047,6 +3088,7 @@ BTF_ID_FLAGS(func, bpf_preempt_enable) BTF_ID_FLAGS(func, bpf_iter_bits_new, KF_ITER_NEW) BTF_ID_FLAGS(func, bpf_iter_bits_next, KF_ITER_NEXT | KF_RET_NULL) BTF_ID_FLAGS(func, bpf_iter_bits_destroy, KF_ITER_DESTROY) +BTF_ID_FLAGS(func, bpf_copy_from_user_str, KF_SLEEPABLE) BTF_KFUNCS_END(common_btf_ids) static const struct btf_kfunc_id_set common_kfunc_set = { diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 35bcf52dbc65..f329ee44627a 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -7512,4 +7512,13 @@ struct bpf_iter_num { __u64 __opaque[1]; } __attribute__((aligned(8))); +/* + * Flags to control BPF kfunc behaviour. + * - BPF_F_PAD_ZEROS: Pad destination buffer with zeros. (See the respective + * helper documentation for details.) + */ +enum bpf_kfunc_flags { + BPF_F_PAD_ZEROS = (1ULL << 0), +}; + #endif /* _UAPI__LINUX_BPF_H__ */ -- cgit v1.2.3 From d29cb3726f03cdac7889f0109a7cb84f79e168a8 Mon Sep 17 00:00:00 2001 From: Pavel Begunkov Date: Wed, 7 Aug 2024 15:18:13 +0100 Subject: io_uring: add absolute mode wait timeouts In addition to current relative timeouts for the waiting loop, where the timespec argument specifies the maximum time it can wait for, add support for the absolute mode, with the value carrying a CLOCK_MONOTONIC absolute time until which we should return control back to the user. Suggested-by: Lewis Baker Signed-off-by: Pavel Begunkov Link: https://lore.kernel.org/r/4d5b74d67ada882590b2e42aa3aa7117bbf6b55f.1723039801.git.asml.silence@gmail.com Signed-off-by: Jens Axboe --- include/uapi/linux/io_uring.h | 1 + io_uring/io_uring.c | 15 ++++++++------- 2 files changed, 9 insertions(+), 7 deletions(-) (limited to 'include/uapi') diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index adc2524fd8e3..6a81f55fcd0d 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -507,6 +507,7 @@ struct io_cqring_offsets { #define IORING_ENTER_SQ_WAIT (1U << 2) #define IORING_ENTER_EXT_ARG (1U << 3) #define IORING_ENTER_REGISTERED_RING (1U << 4) +#define IORING_ENTER_ABS_TIMER (1U << 5) /* * Passed in for io_uring_setup(2). Copied back with updated info on success diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index 9ec07f76ad19..5282f9887440 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -2387,7 +2387,7 @@ static inline int io_cqring_wait_schedule(struct io_ring_ctx *ctx, * Wait until events become available, if we don't already have some. The * application must reap them itself, as they reside on the shared cq ring. */ -static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, +static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, u32 flags, const sigset_t __user *sig, size_t sigsz, struct __kernel_timespec __user *uts) { @@ -2416,13 +2416,13 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, if (uts) { struct timespec64 ts; - ktime_t dt; if (get_timespec64(&ts, uts)) return -EFAULT; - dt = timespec64_to_ktime(ts); - iowq.timeout = ktime_add(dt, ktime_get()); + iowq.timeout = timespec64_to_ktime(ts); + if (!(flags & IORING_ENTER_ABS_TIMER)) + iowq.timeout = ktime_add(iowq.timeout, ktime_get()); } if (sig) { @@ -3153,7 +3153,8 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit, if (unlikely(flags & ~(IORING_ENTER_GETEVENTS | IORING_ENTER_SQ_WAKEUP | IORING_ENTER_SQ_WAIT | IORING_ENTER_EXT_ARG | - IORING_ENTER_REGISTERED_RING))) + IORING_ENTER_REGISTERED_RING | + IORING_ENTER_ABS_TIMER))) return -EINVAL; /* @@ -3251,8 +3252,8 @@ iopoll_locked: if (likely(!ret2)) { min_complete = min(min_complete, ctx->cq_entries); - ret2 = io_cqring_wait(ctx, min_complete, sig, - argsz, ts); + ret2 = io_cqring_wait(ctx, min_complete, flags, + sig, argsz, ts); } } -- cgit v1.2.3 From 2b8e976b984278edbeab3251d370e76d237699f9 Mon Sep 17 00:00:00 2001 From: Pavel Begunkov Date: Wed, 7 Aug 2024 15:18:14 +0100 Subject: io_uring: user registered clockid for wait timeouts Add a new registration opcode IORING_REGISTER_CLOCK, which allows the user to select which clock id it wants to use with CQ waiting timeouts. It only allows a subset of all posix clocks and currently supports CLOCK_MONOTONIC and CLOCK_BOOTTIME. Suggested-by: Lewis Baker Signed-off-by: Pavel Begunkov Link: https://lore.kernel.org/r/98f2bc8a3c36cdf8f0e6a275245e81e903459703.1723039801.git.asml.silence@gmail.com Signed-off-by: Jens Axboe --- include/linux/io_uring_types.h | 3 +++ include/uapi/linux/io_uring.h | 7 +++++++ io_uring/io_uring.c | 8 ++++++-- io_uring/io_uring.h | 8 ++++++++ io_uring/napi.c | 2 +- io_uring/register.c | 31 +++++++++++++++++++++++++++++++ 6 files changed, 56 insertions(+), 3 deletions(-) (limited to 'include/uapi') diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h index 3315005df117..4b9ba523978d 100644 --- a/include/linux/io_uring_types.h +++ b/include/linux/io_uring_types.h @@ -239,6 +239,9 @@ struct io_ring_ctx { struct io_rings *rings; struct percpu_ref refs; + clockid_t clockid; + enum tk_offsets clock_offset; + enum task_work_notify_mode notify_method; unsigned sq_thread_idle; } ____cacheline_aligned_in_smp; diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 6a81f55fcd0d..7af716136df9 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -596,6 +596,8 @@ enum io_uring_register_op { IORING_REGISTER_NAPI = 27, IORING_UNREGISTER_NAPI = 28, + IORING_REGISTER_CLOCK = 29, + /* this goes last */ IORING_REGISTER_LAST, @@ -676,6 +678,11 @@ struct io_uring_restriction { __u32 resv2[3]; }; +struct io_uring_clock_register { + __u32 clockid; + __u32 __resv[3]; +}; + struct io_uring_buf { __u64 addr; __u32 len; diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index 5282f9887440..20229e72b65c 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -2377,7 +2377,8 @@ static inline int io_cqring_wait_schedule(struct io_ring_ctx *ctx, ret = 0; if (iowq->timeout == KTIME_MAX) schedule(); - else if (!schedule_hrtimeout(&iowq->timeout, HRTIMER_MODE_ABS)) + else if (!schedule_hrtimeout_range_clock(&iowq->timeout, 0, + HRTIMER_MODE_ABS, ctx->clockid)) ret = -ETIME; current->in_iowait = 0; return ret; @@ -2422,7 +2423,7 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, u32 flags, iowq.timeout = timespec64_to_ktime(ts); if (!(flags & IORING_ENTER_ABS_TIMER)) - iowq.timeout = ktime_add(iowq.timeout, ktime_get()); + iowq.timeout = ktime_add(iowq.timeout, io_get_time(ctx)); } if (sig) { @@ -3424,6 +3425,9 @@ static __cold int io_uring_create(unsigned entries, struct io_uring_params *p, if (!ctx) return -ENOMEM; + ctx->clockid = CLOCK_MONOTONIC; + ctx->clock_offset = 0; + if ((ctx->flags & IORING_SETUP_DEFER_TASKRUN) && !(ctx->flags & IORING_SETUP_IOPOLL) && !(ctx->flags & IORING_SETUP_SQPOLL)) diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h index c2acf6180845..9935819f12b7 100644 --- a/io_uring/io_uring.h +++ b/io_uring/io_uring.h @@ -437,6 +437,14 @@ static inline bool io_file_can_poll(struct io_kiocb *req) return false; } +static inline ktime_t io_get_time(struct io_ring_ctx *ctx) +{ + if (ctx->clockid == CLOCK_MONOTONIC) + return ktime_get(); + + return ktime_get_with_offset(ctx->clock_offset); +} + enum { IO_CHECK_CQ_OVERFLOW_BIT, IO_CHECK_CQ_DROPPED_BIT, diff --git a/io_uring/napi.c b/io_uring/napi.c index d78fcbecdd27..d0cf694d0172 100644 --- a/io_uring/napi.c +++ b/io_uring/napi.c @@ -283,7 +283,7 @@ void __io_napi_busy_loop(struct io_ring_ctx *ctx, struct io_wait_queue *iowq) iowq->napi_busy_poll_dt = READ_ONCE(ctx->napi_busy_poll_dt); if (iowq->timeout != KTIME_MAX) { - ktime_t dt = ktime_sub(iowq->timeout, ktime_get()); + ktime_t dt = ktime_sub(iowq->timeout, io_get_time(ctx)); iowq->napi_busy_poll_dt = min_t(u64, iowq->napi_busy_poll_dt, dt); } diff --git a/io_uring/register.c b/io_uring/register.c index e3c20be5a198..57cb85c42526 100644 --- a/io_uring/register.c +++ b/io_uring/register.c @@ -335,6 +335,31 @@ err: return ret; } +static int io_register_clock(struct io_ring_ctx *ctx, + struct io_uring_clock_register __user *arg) +{ + struct io_uring_clock_register reg; + + if (copy_from_user(®, arg, sizeof(reg))) + return -EFAULT; + if (memchr_inv(®.__resv, 0, sizeof(reg.__resv))) + return -EINVAL; + + switch (reg.clockid) { + case CLOCK_MONOTONIC: + ctx->clock_offset = 0; + break; + case CLOCK_BOOTTIME: + ctx->clock_offset = TK_OFFS_BOOT; + break; + default: + return -EINVAL; + } + + ctx->clockid = reg.clockid; + return 0; +} + static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode, void __user *arg, unsigned nr_args) __releases(ctx->uring_lock) @@ -511,6 +536,12 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode, break; ret = io_unregister_napi(ctx, arg); break; + case IORING_REGISTER_CLOCK: + ret = -EINVAL; + if (!arg || nr_args) + break; + ret = io_register_clock(ctx, arg); + break; default: ret = -EINVAL; break; -- cgit v1.2.3 From 7ed9e09e2d13d5d43385153bba4734cb0eafd7fd Mon Sep 17 00:00:00 2001 From: Jens Axboe Date: Thu, 4 Jan 2024 10:46:30 -0700 Subject: io_uring: wire up min batch wake timeout Expose min_wait_usec in io_uring_getevents_arg, replacing the pad member that is currently in there. The value is in usecs, which is explained in the name as well. Note that if min_wait_usec and a normal timeout is used in conjunction, the normal timeout is still relative to the base time. For example, if min_wait_usec is set to 100 and the normal timeout is 1000, the max total time waited is still 1000. This also means that if the normal timeout is shorter than min_wait_usec, then only the min_wait_usec will take effect. See previous commit for an explanation of how this works. IORING_FEAT_MIN_TIMEOUT is added as a feature flag for this, as applications doing submit_and_wait_timeout() style operations will generally not see the -EINVAL from the wait side as they return the number of IOs submitted. Only if no IOs are submitted will the -EINVAL bubble back up to the application. Signed-off-by: Jens Axboe --- include/uapi/linux/io_uring.h | 3 ++- io_uring/io_uring.c | 8 ++++---- 2 files changed, 6 insertions(+), 5 deletions(-) (limited to 'include/uapi') diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 7af716136df9..042eab793e26 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -543,6 +543,7 @@ struct io_uring_params { #define IORING_FEAT_LINKED_FILE (1U << 12) #define IORING_FEAT_REG_REG_RING (1U << 13) #define IORING_FEAT_RECVSEND_BUNDLE (1U << 14) +#define IORING_FEAT_MIN_TIMEOUT (1U << 15) /* * io_uring_register(2) opcodes and arguments @@ -766,7 +767,7 @@ enum io_uring_register_restriction_op { struct io_uring_getevents_arg { __u64 sigmask; __u32 sigmask_sz; - __u32 pad; + __u32 min_wait_usec; __u64 ts; }; diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index 968cd5fb3f79..80bb6e2374e9 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -2475,6 +2475,7 @@ struct ext_arg { size_t argsz; struct __kernel_timespec __user *ts; const sigset_t __user *sig; + ktime_t min_time; }; /* @@ -2508,7 +2509,7 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, u32 flags, iowq.cq_min_tail = READ_ONCE(ctx->rings->cq.tail); iowq.nr_timeouts = atomic_read(&ctx->cq_timeouts); iowq.hit_timeout = 0; - iowq.min_timeout = 0; + iowq.min_timeout = ext_arg->min_time; iowq.timeout = KTIME_MAX; start_time = io_get_time(ctx); @@ -3239,8 +3240,7 @@ static int io_get_ext_arg(unsigned flags, const void __user *argp, return -EINVAL; if (copy_from_user(&arg, argp, sizeof(arg))) return -EFAULT; - if (arg.pad) - return -EINVAL; + ext_arg->min_time = arg.min_wait_usec * NSEC_PER_USEC; ext_arg->sig = u64_to_user_ptr(arg.sigmask); ext_arg->argsz = arg.sigmask_sz; ext_arg->ts = u64_to_user_ptr(arg.ts); @@ -3641,7 +3641,7 @@ static __cold int io_uring_create(unsigned entries, struct io_uring_params *p, IORING_FEAT_EXT_ARG | IORING_FEAT_NATIVE_WORKERS | IORING_FEAT_RSRC_TAGS | IORING_FEAT_CQE_SKIP | IORING_FEAT_LINKED_FILE | IORING_FEAT_REG_REG_RING | - IORING_FEAT_RECVSEND_BUNDLE; + IORING_FEAT_RECVSEND_BUNDLE | IORING_FEAT_MIN_TIMEOUT; if (copy_to_user(params, p, sizeof(*p))) { ret = -EFAULT; -- cgit v1.2.3 From 1d4684fbe88dc28e2bf79f5e94a432f0469d2dac Mon Sep 17 00:00:00 2001 From: Nicolin Chen Date: Fri, 2 Aug 2024 17:32:02 -0700 Subject: iommufd: Reorder include files Reorder include files to alphabetic order to simplify maintenance, and separate local headers and global headers with a blank line. No functional change intended. Link: https://patch.msgid.link/r/7524b037cc05afe19db3c18f863253e1d1554fa2.1722644866.git.nicolinc@nvidia.com Signed-off-by: Nicolin Chen Signed-off-by: Jason Gunthorpe --- drivers/iommu/iommufd/device.c | 4 ++-- drivers/iommu/iommufd/fault.c | 4 ++-- drivers/iommu/iommufd/io_pagetable.c | 8 ++++---- drivers/iommu/iommufd/io_pagetable.h | 2 +- drivers/iommu/iommufd/ioas.c | 2 +- drivers/iommu/iommufd/iommufd_private.h | 9 +++++---- drivers/iommu/iommufd/iommufd_test.h | 2 +- drivers/iommu/iommufd/iova_bitmap.c | 2 +- drivers/iommu/iommufd/main.c | 8 ++++---- drivers/iommu/iommufd/pages.c | 10 +++++----- drivers/iommu/iommufd/selftest.c | 9 +++++---- include/linux/iommufd.h | 4 ++-- include/uapi/linux/iommufd.h | 2 +- 13 files changed, 34 insertions(+), 32 deletions(-) (limited to 'include/uapi') diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c index 9a7ec5997c61..11573a84f68a 100644 --- a/drivers/iommu/iommufd/device.c +++ b/drivers/iommu/iommufd/device.c @@ -1,12 +1,12 @@ // SPDX-License-Identifier: GPL-2.0-only /* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES */ +#include #include #include -#include #include -#include "../iommu-priv.h" +#include "../iommu-priv.h" #include "io_pagetable.h" #include "iommufd_private.h" diff --git a/drivers/iommu/iommufd/fault.c b/drivers/iommu/iommufd/fault.c index a643d5c7c535..df03411c8728 100644 --- a/drivers/iommu/iommufd/fault.c +++ b/drivers/iommu/iommufd/fault.c @@ -3,14 +3,14 @@ */ #define pr_fmt(fmt) "iommufd: " fmt +#include #include #include +#include #include #include -#include #include #include -#include #include #include "../iommu-priv.h" diff --git a/drivers/iommu/iommufd/io_pagetable.c b/drivers/iommu/iommufd/io_pagetable.c index 05fd9d3abf1b..bbbc8a044bcf 100644 --- a/drivers/iommu/iommufd/io_pagetable.c +++ b/drivers/iommu/iommufd/io_pagetable.c @@ -8,17 +8,17 @@ * The datastructure uses the iopt_pages to optimize the storage of the PFNs * between the domains and xarray. */ +#include +#include +#include #include #include -#include #include -#include #include -#include #include -#include "io_pagetable.h" #include "double_span.h" +#include "io_pagetable.h" struct iopt_pages_list { struct iopt_pages *pages; diff --git a/drivers/iommu/iommufd/io_pagetable.h b/drivers/iommu/iommufd/io_pagetable.h index 0ec3509b7e33..c61d74471684 100644 --- a/drivers/iommu/iommufd/io_pagetable.h +++ b/drivers/iommu/iommufd/io_pagetable.h @@ -6,8 +6,8 @@ #define __IO_PAGETABLE_H #include -#include #include +#include #include #include "iommufd_private.h" diff --git a/drivers/iommu/iommufd/ioas.c b/drivers/iommu/iommufd/ioas.c index 742248276548..82428e44a837 100644 --- a/drivers/iommu/iommufd/ioas.c +++ b/drivers/iommu/iommufd/ioas.c @@ -3,8 +3,8 @@ * Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES */ #include -#include #include +#include #include #include "io_pagetable.h" diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h index 92efe30a8f0d..017e50574f3b 100644 --- a/drivers/iommu/iommufd/iommufd_private.h +++ b/drivers/iommu/iommufd/iommufd_private.h @@ -4,13 +4,14 @@ #ifndef __IOMMUFD_PRIVATE_H #define __IOMMUFD_PRIVATE_H -#include -#include -#include -#include #include #include +#include +#include +#include +#include #include + #include "../iommu-priv.h" struct iommu_domain; diff --git a/drivers/iommu/iommufd/iommufd_test.h b/drivers/iommu/iommufd/iommufd_test.h index acbbba1c6671..f4bc23a92f9a 100644 --- a/drivers/iommu/iommufd/iommufd_test.h +++ b/drivers/iommu/iommufd/iommufd_test.h @@ -4,8 +4,8 @@ #ifndef _UAPI_IOMMUFD_TEST_H #define _UAPI_IOMMUFD_TEST_H -#include #include +#include enum { IOMMU_TEST_OP_ADD_RESERVED = 1, diff --git a/drivers/iommu/iommufd/iova_bitmap.c b/drivers/iommu/iommufd/iova_bitmap.c index b9e964b1ad5c..d90b9e253412 100644 --- a/drivers/iommu/iommufd/iova_bitmap.c +++ b/drivers/iommu/iommufd/iova_bitmap.c @@ -3,10 +3,10 @@ * Copyright (c) 2022, Oracle and/or its affiliates. * Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved */ +#include #include #include #include -#include #define BITS_PER_PAGE (PAGE_SIZE * BITS_PER_BYTE) diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c index 83bbd7c5d160..b5f5d27ee963 100644 --- a/drivers/iommu/iommufd/main.c +++ b/drivers/iommu/iommufd/main.c @@ -8,15 +8,15 @@ */ #define pr_fmt(fmt) "iommufd: " fmt +#include #include #include -#include -#include +#include #include +#include #include -#include +#include #include -#include #include "io_pagetable.h" #include "iommufd_private.h" diff --git a/drivers/iommu/iommufd/pages.c b/drivers/iommu/iommufd/pages.c index 117f644a0c5b..93d806c9c073 100644 --- a/drivers/iommu/iommufd/pages.c +++ b/drivers/iommu/iommufd/pages.c @@ -45,16 +45,16 @@ * last_iova + 1 can overflow. An iopt_pages index will always be much less than * ULONG_MAX so last_index + 1 cannot overflow. */ +#include +#include +#include +#include #include #include -#include #include -#include -#include -#include -#include "io_pagetable.h" #include "double_span.h" +#include "io_pagetable.h" #ifndef CONFIG_IOMMUFD_TEST #define TEMP_MEMORY_LIMIT 65536 diff --git a/drivers/iommu/iommufd/selftest.c b/drivers/iommu/iommufd/selftest.c index f95e32e29133..04293b20e20c 100644 --- a/drivers/iommu/iommufd/selftest.c +++ b/drivers/iommu/iommufd/selftest.c @@ -3,13 +3,14 @@ * * Kernel side components to support tools/testing/selftests/iommu */ -#include -#include -#include -#include #include +#include #include +#include +#include #include +#include +#include #include #include "../iommu-priv.h" diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h index ffc3a949f837..c2f2f6b9148e 100644 --- a/include/linux/iommufd.h +++ b/include/linux/iommufd.h @@ -6,9 +6,9 @@ #ifndef __LINUX_IOMMUFD_H #define __LINUX_IOMMUFD_H -#include -#include #include +#include +#include struct device; struct iommufd_device; diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h index 4dde745cfb7e..72010f71c5e4 100644 --- a/include/uapi/linux/iommufd.h +++ b/include/uapi/linux/iommufd.h @@ -4,8 +4,8 @@ #ifndef _UAPI_IOMMUFD_H #define _UAPI_IOMMUFD_H -#include #include +#include #define IOMMUFD_TYPE (';') -- cgit v1.2.3 From abcd3026dd63417692a5e80aff70e7cd9b5c14ea Mon Sep 17 00:00:00 2001 From: Oleksij Rempel Date: Thu, 22 Aug 2024 14:07:01 +0200 Subject: ethtool: Extend cable testing interface with result source information Extend the ethtool netlink cable testing interface by adding support for specifying the source of cable testing results. This allows users to differentiate between results obtained through different diagnostic methods. For example, some TI 10BaseT1L PHYs provide two variants of cable diagnostics: Time Domain Reflectometry (TDR) and Active Link Cable Diagnostic (ALCD). By introducing `ETHTOOL_A_CABLE_RESULT_SRC` and `ETHTOOL_A_CABLE_FAULT_LENGTH_SRC` attributes, this update enables drivers to indicate whether the result was derived from TDR or ALCD, improving the clarity and utility of diagnostic information. Signed-off-by: Oleksij Rempel Reviewed-by: Andrew Lunn Link: https://patch.msgid.link/20240822120703.1393130-2-o.rempel@pengutronix.de Signed-off-by: Jakub Kicinski --- Documentation/netlink/specs/ethtool.yaml | 6 ++++++ Documentation/networking/ethtool-netlink.rst | 5 +++++ include/uapi/linux/ethtool_netlink.h | 11 +++++++++++ 3 files changed, 22 insertions(+) (limited to 'include/uapi') diff --git a/Documentation/netlink/specs/ethtool.yaml b/Documentation/netlink/specs/ethtool.yaml index 37211b1e861f..6a050d755b9c 100644 --- a/Documentation/netlink/specs/ethtool.yaml +++ b/Documentation/netlink/specs/ethtool.yaml @@ -667,6 +667,9 @@ attribute-sets: - name: code type: u8 + - + name: src + type: u32 - name: cable-fault-length attributes: @@ -676,6 +679,9 @@ attribute-sets: - name: cm type: u32 + - + name: src + type: u32 - name: cable-nest attributes: diff --git a/Documentation/networking/ethtool-netlink.rst b/Documentation/networking/ethtool-netlink.rst index 44cb9e29f325..ba90457b8b2d 100644 --- a/Documentation/networking/ethtool-netlink.rst +++ b/Documentation/networking/ethtool-netlink.rst @@ -1314,12 +1314,17 @@ information. +-+-+-----------------------------------------+--------+---------------------+ | | | ``ETHTOOL_A_CABLE_RESULTS_CODE`` | u8 | result code | +-+-+-----------------------------------------+--------+---------------------+ + | | | ``ETHTOOL_A_CABLE_RESULT_SRC`` | u32 | information source | + +-+-+-----------------------------------------+--------+---------------------+ | | ``ETHTOOL_A_CABLE_NEST_FAULT_LENGTH`` | nested | cable length | +-+-+-----------------------------------------+--------+---------------------+ | | | ``ETHTOOL_A_CABLE_FAULT_LENGTH_PAIR`` | u8 | pair number | +-+-+-----------------------------------------+--------+---------------------+ | | | ``ETHTOOL_A_CABLE_FAULT_LENGTH_CM`` | u32 | length in cm | +-+-+-----------------------------------------+--------+---------------------+ + | | | ``ETHTOOL_A_CABLE_FAULT_LENGTH_SRC`` | u32 | information source | + +-+-+-----------------------------------------+--------+---------------------+ + CABLE_TEST TDR ============== diff --git a/include/uapi/linux/ethtool_netlink.h b/include/uapi/linux/ethtool_netlink.h index 45d8bcdea056..283305f6b063 100644 --- a/include/uapi/linux/ethtool_netlink.h +++ b/include/uapi/linux/ethtool_netlink.h @@ -573,10 +573,20 @@ enum { ETHTOOL_A_CABLE_PAIR_D, }; +/* Information source for specific results. */ +enum { + ETHTOOL_A_CABLE_INF_SRC_UNSPEC, + /* Results provided by the Time Domain Reflectometry (TDR) */ + ETHTOOL_A_CABLE_INF_SRC_TDR, + /* Results provided by the Active Link Cable Diagnostic (ALCD) */ + ETHTOOL_A_CABLE_INF_SRC_ALCD, +}; + enum { ETHTOOL_A_CABLE_RESULT_UNSPEC, ETHTOOL_A_CABLE_RESULT_PAIR, /* u8 ETHTOOL_A_CABLE_PAIR_ */ ETHTOOL_A_CABLE_RESULT_CODE, /* u8 ETHTOOL_A_CABLE_RESULT_CODE_ */ + ETHTOOL_A_CABLE_RESULT_SRC, /* u32 ETHTOOL_A_CABLE_INF_SRC_ */ __ETHTOOL_A_CABLE_RESULT_CNT, ETHTOOL_A_CABLE_RESULT_MAX = (__ETHTOOL_A_CABLE_RESULT_CNT - 1) @@ -586,6 +596,7 @@ enum { ETHTOOL_A_CABLE_FAULT_LENGTH_UNSPEC, ETHTOOL_A_CABLE_FAULT_LENGTH_PAIR, /* u8 ETHTOOL_A_CABLE_PAIR_ */ ETHTOOL_A_CABLE_FAULT_LENGTH_CM, /* u32 */ + ETHTOOL_A_CABLE_FAULT_LENGTH_SRC, /* u32 ETHTOOL_A_CABLE_INF_SRC_ */ __ETHTOOL_A_CABLE_FAULT_LENGTH_CNT, ETHTOOL_A_CABLE_FAULT_LENGTH_MAX = (__ETHTOOL_A_CABLE_FAULT_LENGTH_CNT - 1) -- cgit v1.2.3 From d24dac8eb8111c401b0de40a17760a0a254bcffc Mon Sep 17 00:00:00 2001 From: Simon Horman Date: Thu, 22 Aug 2024 13:57:22 +0100 Subject: packet: Correct spelling in if_packet.h Correct spelling in if_packet.h As reported by codespell. Signed-off-by: Simon Horman Acked-by: Willem de Bruijn Link: https://patch.msgid.link/20240822-net-spell-v1-1-3a98971ce2d2@kernel.org Signed-off-by: Jakub Kicinski --- include/uapi/linux/if_packet.h | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) (limited to 'include/uapi') diff --git a/include/uapi/linux/if_packet.h b/include/uapi/linux/if_packet.h index 9efc42382fdb..1d2718dd9647 100644 --- a/include/uapi/linux/if_packet.h +++ b/include/uapi/linux/if_packet.h @@ -230,8 +230,8 @@ struct tpacket_hdr_v1 { * ts_first_pkt: * Is always the time-stamp when the block was opened. * Case a) ZERO packets - * No packets to deal with but atleast you know the - * time-interval of this block. + * No packets to deal with but at least you know + * the time-interval of this block. * Case b) Non-zero packets * Use the ts of the first packet in the block. * @@ -265,7 +265,8 @@ enum tpacket_versions { - struct tpacket_hdr - pad to TPACKET_ALIGNMENT=16 - struct sockaddr_ll - - Gap, chosen so that packet data (Start+tp_net) alignes to TPACKET_ALIGNMENT=16 + - Gap, chosen so that packet data (Start+tp_net) aligns to + TPACKET_ALIGNMENT=16 - Start+tp_mac: [ Optional MAC header ] - Start+tp_net: Packet data, aligned to TPACKET_ALIGNMENT=16. - Pad to align to TPACKET_ALIGNMENT=16 -- cgit v1.2.3 From 70d0bb45fae87a3b08970a318e15f317446a1956 Mon Sep 17 00:00:00 2001 From: Simon Horman Date: Thu, 22 Aug 2024 13:57:33 +0100 Subject: net: Correct spelling in headers Correct spelling in Networking headers. As reported by codespell. Signed-off-by: Simon Horman Link: https://patch.msgid.link/20240822-net-spell-v1-12-3a98971ce2d2@kernel.org Signed-off-by: Jakub Kicinski --- include/linux/etherdevice.h | 2 +- include/linux/netdevice.h | 8 ++++---- include/net/addrconf.h | 2 +- include/net/busy_poll.h | 2 +- include/net/caif/caif_layer.h | 4 ++-- include/net/caif/cfpkt.h | 2 +- include/net/dropreason-core.h | 6 +++--- include/net/dst.h | 2 +- include/net/dst_cache.h | 2 +- include/net/erspan.h | 4 ++-- include/net/hwbm.h | 4 ++-- include/net/llc_pdu.h | 2 +- include/net/netlink.h | 16 ++++++++-------- include/net/netns/sctp.h | 4 ++-- include/net/regulatory.h | 2 +- include/net/sock.h | 4 ++-- include/net/udp.h | 2 +- include/uapi/linux/in.h | 2 +- include/uapi/linux/inet_diag.h | 2 +- 19 files changed, 36 insertions(+), 36 deletions(-) (limited to 'include/uapi') diff --git a/include/linux/etherdevice.h b/include/linux/etherdevice.h index 0ed47d00549b..30114c25ad12 100644 --- a/include/linux/etherdevice.h +++ b/include/linux/etherdevice.h @@ -645,7 +645,7 @@ static inline struct ethhdr *eth_skb_pull_mac(struct sk_buff *skb) } /** - * eth_skb_pad - Pad buffer to mininum number of octets for Ethernet frame + * eth_skb_pad - Pad buffer to minimum number of octets for Ethernet frame * @skb: Buffer to pad * * An Ethernet frame should have a minimum size of 60 bytes. This function diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 7dfa836d9dbf..fce70990b209 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1237,7 +1237,7 @@ struct netdev_net_notifier { * int (*ndo_fdb_del)(struct ndmsg *ndm, struct nlattr *tb[], * struct net_device *dev, * const unsigned char *addr, u16 vid) - * Deletes the FDB entry from dev coresponding to addr. + * Deletes the FDB entry from dev corresponding to addr. * int (*ndo_fdb_del_bulk)(struct nlmsghdr *nlh, struct net_device *dev, * struct netlink_ext_ack *extack); * int (*ndo_fdb_dump)(struct sk_buff *skb, struct netlink_callback *cb, @@ -3514,7 +3514,7 @@ static inline void netdev_tx_completed_queue(struct netdev_queue *dev_queue, dql_completed(&dev_queue->dql, bytes); /* - * Without the memory barrier there is a small possiblity that + * Without the memory barrier there is a small possibility that * netdev_tx_sent_queue will miss the update and cause the queue to * be stopped forever */ @@ -4583,7 +4583,7 @@ void dev_uc_flush(struct net_device *dev); void dev_uc_init(struct net_device *dev); /** - * __dev_uc_sync - Synchonize device's unicast list + * __dev_uc_sync - Synchronize device's unicast list * @dev: device to sync * @sync: function to call if address should be added * @unsync: function to call if address should be removed @@ -4627,7 +4627,7 @@ void dev_mc_flush(struct net_device *dev); void dev_mc_init(struct net_device *dev); /** - * __dev_mc_sync - Synchonize device's multicast list + * __dev_mc_sync - Synchronize device's multicast list * @dev: device to sync * @sync: function to call if address should be added * @unsync: function to call if address should be removed diff --git a/include/net/addrconf.h b/include/net/addrconf.h index c8ed31828db3..363dd63babe7 100644 --- a/include/net/addrconf.h +++ b/include/net/addrconf.h @@ -333,7 +333,7 @@ static inline struct inet6_dev *__in6_dev_get(const struct net_device *dev) /** * __in6_dev_stats_get - get inet6_dev pointer for stats * @dev: network device - * @skb: skb for original incoming interface if neeeded + * @skb: skb for original incoming interface if needed * * Caller must hold rcu_read_lock or RTNL, because this function * does not take a reference on the inet6_dev. diff --git a/include/net/busy_poll.h b/include/net/busy_poll.h index 9b09acac538e..5e3b5703e4ab 100644 --- a/include/net/busy_poll.h +++ b/include/net/busy_poll.h @@ -131,7 +131,7 @@ static inline void skb_mark_napi_id(struct sk_buff *skb, #endif } -/* used in the protocol hanlder to propagate the napi_id to the socket */ +/* used in the protocol handler to propagate the napi_id to the socket */ static inline void sk_mark_napi_id(struct sock *sk, const struct sk_buff *skb) { #ifdef CONFIG_NET_RX_BUSY_POLL diff --git a/include/net/caif/caif_layer.h b/include/net/caif/caif_layer.h index 0f45d875905f..053e7c6a6a66 100644 --- a/include/net/caif/caif_layer.h +++ b/include/net/caif/caif_layer.h @@ -20,7 +20,7 @@ struct caif_payload_info; * @assert: expression to evaluate. * * This function will print a error message and a do WARN_ON if the - * assertion failes. Normally this will do a stack up at the current location. + * assertion fails. Normally this will do a stack up at the current location. */ #define caif_assert(assert) \ do { \ @@ -116,7 +116,7 @@ enum caif_direction { * @dn: Pointer down to the layer below. * @node: List node used when layer participate in a list. * @receive: Packet receive function. - * @transmit: Packet transmit funciton. + * @transmit: Packet transmit function. * @ctrlcmd: Used for control signalling upwards in the stack. * @modemcmd: Used for control signaling downwards in the stack. * @id: The identity of this layer diff --git a/include/net/caif/cfpkt.h b/include/net/caif/cfpkt.h index 44d914a50369..acf664227d96 100644 --- a/include/net/caif/cfpkt.h +++ b/include/net/caif/cfpkt.h @@ -18,7 +18,7 @@ struct cfpkt *cfpkt_create(u16 len); /* * Destroy a CAIF Packet. - * pkt Packet to be destoyed. + * pkt Packet to be destroyed. */ void cfpkt_destroy(struct cfpkt *pkt); diff --git a/include/net/dropreason-core.h b/include/net/dropreason-core.h index 9707ab54fdd5..4748680e8c88 100644 --- a/include/net/dropreason-core.h +++ b/include/net/dropreason-core.h @@ -155,8 +155,8 @@ enum skb_drop_reason { /** @SKB_DROP_REASON_SOCKET_RCVBUFF: socket receive buff is full */ SKB_DROP_REASON_SOCKET_RCVBUFF, /** - * @SKB_DROP_REASON_PROTO_MEM: proto memory limition, such as udp packet - * drop out of udp_memory_allocated. + * @SKB_DROP_REASON_PROTO_MEM: proto memory limitation, such as + * udp packet drop out of udp_memory_allocated. */ SKB_DROP_REASON_PROTO_MEM, /** @@ -217,7 +217,7 @@ enum skb_drop_reason { */ SKB_DROP_REASON_TCP_ZEROWINDOW, /** - * @SKB_DROP_REASON_TCP_OLD_DATA: the TCP data reveived is already + * @SKB_DROP_REASON_TCP_OLD_DATA: the TCP data received is already * received before (spurious retrans may happened), see * LINUX_MIB_DELAYEDACKLOST */ diff --git a/include/net/dst.h b/include/net/dst.h index 0aa331bd2fdb..0f303cc60252 100644 --- a/include/net/dst.h +++ b/include/net/dst.h @@ -341,7 +341,7 @@ static inline void __skb_tunnel_rx(struct sk_buff *skb, struct net_device *dev, skb->dev = dev; /* - * Clear hash so that we can recalulate the hash for the + * Clear hash so that we can recalculate the hash for the * encapsulated packet, unless we have already determine the hash * over the L4 4-tuple. */ diff --git a/include/net/dst_cache.h b/include/net/dst_cache.h index b4a55d2d5e71..1961699598e2 100644 --- a/include/net/dst_cache.h +++ b/include/net/dst_cache.h @@ -102,7 +102,7 @@ int dst_cache_init(struct dst_cache *dst_cache, gfp_t gfp); * @dst_cache: the cache * * No synchronization is enforced: it must be called only when the cache - * is unsed. + * is unused. */ void dst_cache_destroy(struct dst_cache *dst_cache); diff --git a/include/net/erspan.h b/include/net/erspan.h index 6cb4cbd6a48f..c6209e7b6c96 100644 --- a/include/net/erspan.h +++ b/include/net/erspan.h @@ -89,7 +89,7 @@ enum erspan_encap_type { ERSPAN_ENCAP_NOVLAN = 0x0, /* originally without VLAN tag */ ERSPAN_ENCAP_ISL = 0x1, /* originally ISL encapsulated */ ERSPAN_ENCAP_8021Q = 0x2, /* originally 802.1Q encapsulated */ - ERSPAN_ENCAP_INFRAME = 0x3, /* VLAN tag perserved in frame */ + ERSPAN_ENCAP_INFRAME = 0x3, /* VLAN tag preserved in frame */ }; #define ERSPAN_V1_MDSIZE 4 @@ -192,7 +192,7 @@ static inline void erspan_build_header(struct sk_buff *skb, enc_type = ERSPAN_ENCAP_NOVLAN; /* If mirrored packet has vlan tag, extract tci and - * perserve vlan header in the mirrored frame. + * preserve vlan header in the mirrored frame. */ if (eth->h_proto == htons(ETH_P_8021Q)) { qp = (struct qtag_prefix *)(skb->data + 2 * ETH_ALEN); diff --git a/include/net/hwbm.h b/include/net/hwbm.h index aa495decec35..bdbe91c609ff 100644 --- a/include/net/hwbm.h +++ b/include/net/hwbm.h @@ -11,9 +11,9 @@ struct hwbm_pool { int frag_size; /* Number of buffers currently used by this pool */ int buf_num; - /* constructor called during alocation */ + /* constructor called during allocation */ int (*construct)(struct hwbm_pool *bm_pool, void *buf); - /* protect acces to the buffer counter*/ + /* protect access to the buffer counter*/ struct mutex buf_lock; /* private data */ void *priv; diff --git a/include/net/llc_pdu.h b/include/net/llc_pdu.h index 1d55ba7c45be..86681f29bda7 100644 --- a/include/net/llc_pdu.h +++ b/include/net/llc_pdu.h @@ -254,7 +254,7 @@ static inline void llc_pdu_header_init(struct sk_buff *skb, u8 type, } /** - * llc_pdu_decode_sa - extracs source address (MAC) of input frame + * llc_pdu_decode_sa - extracts, source address (MAC) of input frame * @skb: input skb that source address must be extracted from it. * @sa: pointer to source address (6 byte array). * diff --git a/include/net/netlink.h b/include/net/netlink.h index e78ce008e07c..db6af207287c 100644 --- a/include/net/netlink.h +++ b/include/net/netlink.h @@ -827,7 +827,7 @@ nlmsg_parse_deprecated_strict(const struct nlmsghdr *nlh, int hdrlen, /** * nlmsg_find_attr - find a specific attribute in a netlink message * @nlh: netlink message header - * @hdrlen: length of familiy specific header + * @hdrlen: length of family specific header * @attrtype: type of attribute to look for * * Returns the first attribute which matches the specified type. @@ -849,7 +849,7 @@ static inline struct nlattr *nlmsg_find_attr(const struct nlmsghdr *nlh, * * Validates all attributes in the specified attribute stream against the * specified policy. Validation is done in liberal mode. - * See documenation of struct nla_policy for more details. + * See documentation of struct nla_policy for more details. * * Returns 0 on success or a negative error code. */ @@ -872,7 +872,7 @@ static inline int nla_validate_deprecated(const struct nlattr *head, int len, * * Validates all attributes in the specified attribute stream against the * specified policy. Validation is done in strict mode. - * See documenation of struct nla_policy for more details. + * See documentation of struct nla_policy for more details. * * Returns 0 on success or a negative error code. */ @@ -887,7 +887,7 @@ static inline int nla_validate(const struct nlattr *head, int len, int maxtype, /** * nlmsg_validate_deprecated - validate a netlink message including attributes * @nlh: netlinket message header - * @hdrlen: length of familiy specific header + * @hdrlen: length of family specific header * @maxtype: maximum attribute type to be expected * @policy: validation policy * @extack: extended ACK report struct @@ -933,7 +933,7 @@ static inline u32 nlmsg_seq(const struct nlmsghdr *nlh) * nlmsg_for_each_attr - iterate over a stream of attributes * @pos: loop counter, set to current attribute * @nlh: netlink message header - * @hdrlen: length of familiy specific header + * @hdrlen: length of family specific header * @rem: initialized to len, holds bytes currently remaining in stream */ #define nlmsg_for_each_attr(pos, nlh, hdrlen, rem) \ @@ -1034,7 +1034,7 @@ static inline struct sk_buff *nlmsg_new_large(size_t payload) * @skb: socket buffer the message is stored in * @nlh: netlink message header * - * Corrects the netlink message header to include the appeneded + * Corrects the netlink message header to include the appended * attributes. Only necessary if attributes have been added to * the message. */ @@ -1954,7 +1954,7 @@ static inline struct nlattr *nla_nest_start(struct sk_buff *skb, int attrtype) * @start: container attribute * * Corrects the container attribute header to include the all - * appeneded attributes. + * appended attributes. * * Returns the total data length of the skb. */ @@ -1987,7 +1987,7 @@ static inline void nla_nest_cancel(struct sk_buff *skb, struct nlattr *start) * * Validates all attributes in the nested attribute stream against the * specified policy. Attributes with a type exceeding maxtype will be - * ignored. See documenation of struct nla_policy for more details. + * ignored. See documentation of struct nla_policy for more details. * * Returns 0 on success or a negative error code. */ diff --git a/include/net/netns/sctp.h b/include/net/netns/sctp.h index 7eff3d981b89..d25cd7a9c5ff 100644 --- a/include/net/netns/sctp.h +++ b/include/net/netns/sctp.h @@ -125,14 +125,14 @@ struct netns_sctp { int pf_expose; /* - * Policy for preforming sctp/socket accounting + * Policy for performing sctp/socket accounting * 0 - do socket level accounting, all assocs share sk_sndbuf * 1 - do sctp accounting, each asoc may use sk_sndbuf bytes */ int sndbuf_policy; /* - * Policy for preforming sctp/socket accounting + * Policy for performing sctp/socket accounting * 0 - do socket level accounting, all assocs share sk_rcvbuf * 1 - do sctp accounting, each asoc may use sk_rcvbuf bytes */ diff --git a/include/net/regulatory.h b/include/net/regulatory.h index a103f4c8cf75..6633627f6e76 100644 --- a/include/net/regulatory.h +++ b/include/net/regulatory.h @@ -121,7 +121,7 @@ struct regulatory_request { * @REGULATORY_DISABLE_BEACON_HINTS: enable this if your driver needs to * ensure that passive scan flags and beaconing flags may not be lifted by * cfg80211 due to regulatory beacon hints. For more information on beacon - * hints read the documenation for regulatory_hint_found_beacon() + * hints read the documentation for regulatory_hint_found_beacon() * @REGULATORY_COUNTRY_IE_FOLLOW_POWER: for devices that have a preference * that even though they may have programmed their own custom power * setting prior to wiphy registration, they want to ensure their channel diff --git a/include/net/sock.h b/include/net/sock.h index cce23ac4d514..f51d61fab059 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -1624,7 +1624,7 @@ bool __lock_sock_fast(struct sock *sk) __acquires(&sk->sk_lock.slock); * lock_sock_fast - fast version of lock_sock * @sk: socket * - * This version should be used for very small section, where process wont block + * This version should be used for very small section, where process won't block * return false if fast path is taken: * * sk_lock.slock locked, owned = 0, BH disabled @@ -2546,7 +2546,7 @@ struct sock_skb_cb { /* Store sock_skb_cb at the end of skb->cb[] so protocol families * using skb->cb[] would keep using it directly and utilize its - * alignement guarantee. + * alignment guarantee. */ #define SOCK_SKB_CB_OFFSET ((sizeof_field(struct sk_buff, cb) - \ sizeof(struct sock_skb_cb))) diff --git a/include/net/udp.h b/include/net/udp.h index 5ca53b1cec67..61222545ab1c 100644 --- a/include/net/udp.h +++ b/include/net/udp.h @@ -232,7 +232,7 @@ static inline __be16 udp_flow_src_port(struct net *net, struct sk_buff *skb, } /* Since this is being sent on the wire obfuscate hash a bit - * to minimize possbility that any useful information to an + * to minimize possibility that any useful information to an * attacker is leaked. Only upper 16 bits are relevant in the * computation for 16 bit port value. */ diff --git a/include/uapi/linux/in.h b/include/uapi/linux/in.h index d358add1611c..5d32d53508d9 100644 --- a/include/uapi/linux/in.h +++ b/include/uapi/linux/in.h @@ -141,7 +141,7 @@ struct in_addr { */ #define IP_PMTUDISC_INTERFACE 4 /* weaker version of IP_PMTUDISC_INTERFACE, which allows packets to get - * fragmented if they exeed the interface mtu + * fragmented if they exceed the interface mtu */ #define IP_PMTUDISC_OMIT 5 diff --git a/include/uapi/linux/inet_diag.h b/include/uapi/linux/inet_diag.h index 50655de04c9b..86bb2e8b17c9 100644 --- a/include/uapi/linux/inet_diag.h +++ b/include/uapi/linux/inet_diag.h @@ -143,7 +143,7 @@ enum { INET_DIAG_SHUTDOWN, /* - * Next extenstions cannot be requested in struct inet_diag_req_v2: + * Next extensions cannot be requested in struct inet_diag_req_v2: * its field idiag_ext has only 8 bits. */ -- cgit v1.2.3 From cda1fba15cb2282b3c364805c9767698f11c3b0e Mon Sep 17 00:00:00 2001 From: Arkadiusz Kubalewski Date: Fri, 23 Aug 2024 00:25:12 +0200 Subject: dpll: add Embedded SYNC feature for a pin Implement and document new pin attributes for providing Embedded SYNC capabilities to the DPLL subsystem users through a netlink pin-get do/dump messages. Allow the user to set Embedded SYNC frequency with pin-set do netlink message. Reviewed-by: Aleksandr Loktionov Signed-off-by: Arkadiusz Kubalewski Reviewed-by: Jiri Pirko Link: https://patch.msgid.link/20240822222513.255179-2-arkadiusz.kubalewski@intel.com Signed-off-by: Jakub Kicinski --- Documentation/driver-api/dpll.rst | 21 ++++++ Documentation/netlink/specs/dpll.yaml | 24 +++++++ drivers/dpll/dpll_netlink.c | 130 ++++++++++++++++++++++++++++++++++ drivers/dpll/dpll_nl.c | 5 +- include/linux/dpll.h | 15 ++++ include/uapi/linux/dpll.h | 3 + 6 files changed, 196 insertions(+), 2 deletions(-) (limited to 'include/uapi') diff --git a/Documentation/driver-api/dpll.rst b/Documentation/driver-api/dpll.rst index ea8d16600e16..e6855cd37e85 100644 --- a/Documentation/driver-api/dpll.rst +++ b/Documentation/driver-api/dpll.rst @@ -214,6 +214,27 @@ offset values are fractional with 3-digit decimal places and shell be divided with ``DPLL_PIN_PHASE_OFFSET_DIVIDER`` to get integer part and modulo divided to get fractional part. +Embedded SYNC +============= + +Device may provide ability to use Embedded SYNC feature. It allows +to embed additional SYNC signal into the base frequency of a pin - a one +special pulse of base frequency signal every time SYNC signal pulse +happens. The user can configure the frequency of Embedded SYNC. +The Embedded SYNC capability is always related to a given base frequency +and HW capabilities. The user is provided a range of Embedded SYNC +frequencies supported, depending on current base frequency configured for +the pin. + + ========================================= ================================= + ``DPLL_A_PIN_ESYNC_FREQUENCY`` current Embedded SYNC frequency + ``DPLL_A_PIN_ESYNC_FREQUENCY_SUPPORTED`` nest available Embedded SYNC + frequency ranges + ``DPLL_A_PIN_FREQUENCY_MIN`` attr minimum value of frequency + ``DPLL_A_PIN_FREQUENCY_MAX`` attr maximum value of frequency + ``DPLL_A_PIN_ESYNC_PULSE`` pulse type of Embedded SYNC + ========================================= ================================= + Configuration commands group ============================ diff --git a/Documentation/netlink/specs/dpll.yaml b/Documentation/netlink/specs/dpll.yaml index 94132d30e0e0..f2894ca35de8 100644 --- a/Documentation/netlink/specs/dpll.yaml +++ b/Documentation/netlink/specs/dpll.yaml @@ -345,6 +345,26 @@ attribute-sets: Value is in PPM (parts per million). This may be implemented for example for pin of type PIN_TYPE_SYNCE_ETH_PORT. + - + name: esync-frequency + type: u64 + doc: | + Frequency of Embedded SYNC signal. If provided, the pin is configured + with a SYNC signal embedded into its base clock frequency. + - + name: esync-frequency-supported + type: nest + multi-attr: true + nested-attributes: frequency-range + doc: | + If provided a pin is capable of embedding a SYNC signal (within given + range) into its base frequency signal. + - + name: esync-pulse + type: u32 + doc: | + A ratio of high to low state of a SYNC signal pulse embedded + into base clock frequency. Value is in percents. - name: pin-parent-device subset-of: pin @@ -510,6 +530,9 @@ operations: - phase-adjust-max - phase-adjust - fractional-frequency-offset + - esync-frequency + - esync-frequency-supported + - esync-pulse dump: request: @@ -536,6 +559,7 @@ operations: - parent-device - parent-pin - phase-adjust + - esync-frequency - name: pin-create-ntf doc: Notification about pin appearing diff --git a/drivers/dpll/dpll_netlink.c b/drivers/dpll/dpll_netlink.c index 98e6ad8528d3..fc0280dcddd1 100644 --- a/drivers/dpll/dpll_netlink.c +++ b/drivers/dpll/dpll_netlink.c @@ -342,6 +342,51 @@ dpll_msg_add_pin_freq(struct sk_buff *msg, struct dpll_pin *pin, return 0; } +static int +dpll_msg_add_pin_esync(struct sk_buff *msg, struct dpll_pin *pin, + struct dpll_pin_ref *ref, struct netlink_ext_ack *extack) +{ + const struct dpll_pin_ops *ops = dpll_pin_ops(ref); + struct dpll_device *dpll = ref->dpll; + struct dpll_pin_esync esync; + struct nlattr *nest; + int ret, i; + + if (!ops->esync_get) + return 0; + ret = ops->esync_get(pin, dpll_pin_on_dpll_priv(dpll, pin), dpll, + dpll_priv(dpll), &esync, extack); + if (ret == -EOPNOTSUPP) + return 0; + else if (ret) + return ret; + if (nla_put_64bit(msg, DPLL_A_PIN_ESYNC_FREQUENCY, sizeof(esync.freq), + &esync.freq, DPLL_A_PIN_PAD)) + return -EMSGSIZE; + if (nla_put_u32(msg, DPLL_A_PIN_ESYNC_PULSE, esync.pulse)) + return -EMSGSIZE; + for (i = 0; i < esync.range_num; i++) { + nest = nla_nest_start(msg, + DPLL_A_PIN_ESYNC_FREQUENCY_SUPPORTED); + if (!nest) + return -EMSGSIZE; + if (nla_put_64bit(msg, DPLL_A_PIN_FREQUENCY_MIN, + sizeof(esync.range[i].min), + &esync.range[i].min, DPLL_A_PIN_PAD)) + goto nest_cancel; + if (nla_put_64bit(msg, DPLL_A_PIN_FREQUENCY_MAX, + sizeof(esync.range[i].max), + &esync.range[i].max, DPLL_A_PIN_PAD)) + goto nest_cancel; + nla_nest_end(msg, nest); + } + return 0; + +nest_cancel: + nla_nest_cancel(msg, nest); + return -EMSGSIZE; +} + static bool dpll_pin_is_freq_supported(struct dpll_pin *pin, u32 freq) { int fs; @@ -481,6 +526,9 @@ dpll_cmd_pin_get_one(struct sk_buff *msg, struct dpll_pin *pin, if (ret) return ret; ret = dpll_msg_add_ffo(msg, pin, ref, extack); + if (ret) + return ret; + ret = dpll_msg_add_pin_esync(msg, pin, ref, extack); if (ret) return ret; if (xa_empty(&pin->parent_refs)) @@ -738,6 +786,83 @@ rollback: return ret; } +static int +dpll_pin_esync_set(struct dpll_pin *pin, struct nlattr *a, + struct netlink_ext_ack *extack) +{ + struct dpll_pin_ref *ref, *failed; + const struct dpll_pin_ops *ops; + struct dpll_pin_esync esync; + u64 freq = nla_get_u64(a); + struct dpll_device *dpll; + bool supported = false; + unsigned long i; + int ret; + + xa_for_each(&pin->dpll_refs, i, ref) { + ops = dpll_pin_ops(ref); + if (!ops->esync_set || !ops->esync_get) { + NL_SET_ERR_MSG(extack, + "embedded sync feature is not supported by this device"); + return -EOPNOTSUPP; + } + } + ref = dpll_xa_ref_dpll_first(&pin->dpll_refs); + ops = dpll_pin_ops(ref); + dpll = ref->dpll; + ret = ops->esync_get(pin, dpll_pin_on_dpll_priv(dpll, pin), dpll, + dpll_priv(dpll), &esync, extack); + if (ret) { + NL_SET_ERR_MSG(extack, "unable to get current embedded sync frequency value"); + return ret; + } + if (freq == esync.freq) + return 0; + for (i = 0; i < esync.range_num; i++) + if (freq <= esync.range[i].max && freq >= esync.range[i].min) + supported = true; + if (!supported) { + NL_SET_ERR_MSG_ATTR(extack, a, + "requested embedded sync frequency value is not supported by this device"); + return -EINVAL; + } + + xa_for_each(&pin->dpll_refs, i, ref) { + void *pin_dpll_priv; + + ops = dpll_pin_ops(ref); + dpll = ref->dpll; + pin_dpll_priv = dpll_pin_on_dpll_priv(dpll, pin); + ret = ops->esync_set(pin, pin_dpll_priv, dpll, dpll_priv(dpll), + freq, extack); + if (ret) { + failed = ref; + NL_SET_ERR_MSG_FMT(extack, + "embedded sync frequency set failed for dpll_id: %u", + dpll->id); + goto rollback; + } + } + __dpll_pin_change_ntf(pin); + + return 0; + +rollback: + xa_for_each(&pin->dpll_refs, i, ref) { + void *pin_dpll_priv; + + if (ref == failed) + break; + ops = dpll_pin_ops(ref); + dpll = ref->dpll; + pin_dpll_priv = dpll_pin_on_dpll_priv(dpll, pin); + if (ops->esync_set(pin, pin_dpll_priv, dpll, dpll_priv(dpll), + esync.freq, extack)) + NL_SET_ERR_MSG(extack, "set embedded sync frequency rollback failed"); + } + return ret; +} + static int dpll_pin_on_pin_state_set(struct dpll_pin *pin, u32 parent_idx, enum dpll_pin_state state, @@ -1039,6 +1164,11 @@ dpll_pin_set_from_nlattr(struct dpll_pin *pin, struct genl_info *info) if (ret) return ret; break; + case DPLL_A_PIN_ESYNC_FREQUENCY: + ret = dpll_pin_esync_set(pin, a, info->extack); + if (ret) + return ret; + break; } } diff --git a/drivers/dpll/dpll_nl.c b/drivers/dpll/dpll_nl.c index 1e95f5397cfc..fe9b6893d261 100644 --- a/drivers/dpll/dpll_nl.c +++ b/drivers/dpll/dpll_nl.c @@ -62,7 +62,7 @@ static const struct nla_policy dpll_pin_get_dump_nl_policy[DPLL_A_PIN_ID + 1] = }; /* DPLL_CMD_PIN_SET - do */ -static const struct nla_policy dpll_pin_set_nl_policy[DPLL_A_PIN_PHASE_ADJUST + 1] = { +static const struct nla_policy dpll_pin_set_nl_policy[DPLL_A_PIN_ESYNC_FREQUENCY + 1] = { [DPLL_A_PIN_ID] = { .type = NLA_U32, }, [DPLL_A_PIN_FREQUENCY] = { .type = NLA_U64, }, [DPLL_A_PIN_DIRECTION] = NLA_POLICY_RANGE(NLA_U32, 1, 2), @@ -71,6 +71,7 @@ static const struct nla_policy dpll_pin_set_nl_policy[DPLL_A_PIN_PHASE_ADJUST + [DPLL_A_PIN_PARENT_DEVICE] = NLA_POLICY_NESTED(dpll_pin_parent_device_nl_policy), [DPLL_A_PIN_PARENT_PIN] = NLA_POLICY_NESTED(dpll_pin_parent_pin_nl_policy), [DPLL_A_PIN_PHASE_ADJUST] = { .type = NLA_S32, }, + [DPLL_A_PIN_ESYNC_FREQUENCY] = { .type = NLA_U64, }, }; /* Ops table for dpll */ @@ -138,7 +139,7 @@ static const struct genl_split_ops dpll_nl_ops[] = { .doit = dpll_nl_pin_set_doit, .post_doit = dpll_pin_post_doit, .policy = dpll_pin_set_nl_policy, - .maxattr = DPLL_A_PIN_PHASE_ADJUST, + .maxattr = DPLL_A_PIN_ESYNC_FREQUENCY, .flags = GENL_ADMIN_PERM | GENL_CMD_CAP_DO, }, }; diff --git a/include/linux/dpll.h b/include/linux/dpll.h index d275736230b3..81f7b623d0ba 100644 --- a/include/linux/dpll.h +++ b/include/linux/dpll.h @@ -15,6 +15,7 @@ struct dpll_device; struct dpll_pin; +struct dpll_pin_esync; struct dpll_device_ops { int (*mode_get)(const struct dpll_device *dpll, void *dpll_priv, @@ -83,6 +84,13 @@ struct dpll_pin_ops { int (*ffo_get)(const struct dpll_pin *pin, void *pin_priv, const struct dpll_device *dpll, void *dpll_priv, s64 *ffo, struct netlink_ext_ack *extack); + int (*esync_set)(const struct dpll_pin *pin, void *pin_priv, + const struct dpll_device *dpll, void *dpll_priv, + u64 freq, struct netlink_ext_ack *extack); + int (*esync_get)(const struct dpll_pin *pin, void *pin_priv, + const struct dpll_device *dpll, void *dpll_priv, + struct dpll_pin_esync *esync, + struct netlink_ext_ack *extack); }; struct dpll_pin_frequency { @@ -111,6 +119,13 @@ struct dpll_pin_phase_adjust_range { s32 max; }; +struct dpll_pin_esync { + u64 freq; + const struct dpll_pin_frequency *range; + u8 range_num; + u8 pulse; +}; + struct dpll_pin_properties { const char *board_label; const char *panel_label; diff --git a/include/uapi/linux/dpll.h b/include/uapi/linux/dpll.h index 0c13d7f1a1bc..b0654ade7b7e 100644 --- a/include/uapi/linux/dpll.h +++ b/include/uapi/linux/dpll.h @@ -210,6 +210,9 @@ enum dpll_a_pin { DPLL_A_PIN_PHASE_ADJUST, DPLL_A_PIN_PHASE_OFFSET, DPLL_A_PIN_FRACTIONAL_FREQUENCY_OFFSET, + DPLL_A_PIN_ESYNC_FREQUENCY, + DPLL_A_PIN_ESYNC_FREQUENCY_SUPPORTED, + DPLL_A_PIN_ESYNC_PULSE, __DPLL_A_PIN_MAX, DPLL_A_PIN_MAX = (__DPLL_A_PIN_MAX - 1) -- cgit v1.2.3 From d8ea645d6984c84a87032063a0941f15a323831f Mon Sep 17 00:00:00 2001 From: Selvin Xavier Date: Sun, 18 Aug 2024 21:47:26 -0700 Subject: RDMA/bnxt_re: Handle variable WQE support for user applications User library calculates the number of slots required for user applications and it can pass that information to the driver. Driver can use this value and update the HW directly. This mechanism is currently used only for the newly introduced variable size WQEs. Extend the bnxt_re_qp_req structure to pass the Send Queue slot count. Reorganize the code to get the sq_slots before initializing the Send Queue attributes. Link: https://patch.msgid.link/r/1724042847-1481-5-git-send-email-selvin.xavier@broadcom.com Signed-off-by: Hongguang Gao Signed-off-by: Selvin Xavier Signed-off-by: Jason Gunthorpe --- drivers/infiniband/hw/bnxt_re/ib_verbs.c | 106 ++++++++++++++++++------------- drivers/infiniband/hw/bnxt_re/ib_verbs.h | 16 ++++- include/uapi/rdma/bnxt_re-abi.h | 6 ++ 3 files changed, 82 insertions(+), 46 deletions(-) (limited to 'include/uapi') diff --git a/drivers/infiniband/hw/bnxt_re/ib_verbs.c b/drivers/infiniband/hw/bnxt_re/ib_verbs.c index 4dd137b7a5ce..2932db129958 100644 --- a/drivers/infiniband/hw/bnxt_re/ib_verbs.c +++ b/drivers/infiniband/hw/bnxt_re/ib_verbs.c @@ -1017,20 +1017,15 @@ static int bnxt_re_setup_swqe_size(struct bnxt_re_qp *qp, } static int bnxt_re_init_user_qp(struct bnxt_re_dev *rdev, struct bnxt_re_pd *pd, - struct bnxt_re_qp *qp, struct ib_udata *udata) + struct bnxt_re_qp *qp, struct bnxt_re_ucontext *cntx, + struct bnxt_re_qp_req *ureq) { struct bnxt_qplib_qp *qplib_qp; - struct bnxt_re_ucontext *cntx; - struct bnxt_re_qp_req ureq; int bytes = 0, psn_sz; struct ib_umem *umem; int psn_nume; qplib_qp = &qp->qplib_qp; - cntx = rdma_udata_to_drv_context(udata, struct bnxt_re_ucontext, - ib_uctx); - if (ib_copy_from_udata(&ureq, udata, sizeof(ureq))) - return -EFAULT; bytes = (qplib_qp->sq.max_wqe * qplib_qp->sq.wqe_size); /* Consider mapping PSN search memory only for RC QPs. */ @@ -1038,17 +1033,20 @@ static int bnxt_re_init_user_qp(struct bnxt_re_dev *rdev, struct bnxt_re_pd *pd, psn_sz = bnxt_qplib_is_chip_gen_p5_p7(rdev->chip_ctx) ? sizeof(struct sq_psn_search_ext) : sizeof(struct sq_psn_search); - psn_nume = (qplib_qp->wqe_mode == BNXT_QPLIB_WQE_MODE_STATIC) ? - qplib_qp->sq.max_wqe : - ((qplib_qp->sq.max_wqe * qplib_qp->sq.wqe_size) / - sizeof(struct bnxt_qplib_sge)); + if (cntx && bnxt_re_is_var_size_supported(rdev, cntx)) { + psn_nume = ureq->sq_slots; + } else { + psn_nume = (qplib_qp->wqe_mode == BNXT_QPLIB_WQE_MODE_STATIC) ? + qplib_qp->sq.max_wqe : ((qplib_qp->sq.max_wqe * qplib_qp->sq.wqe_size) / + sizeof(struct bnxt_qplib_sge)); + } if (_is_host_msn_table(rdev->qplib_res.dattr->dev_cap_flags2)) psn_nume = roundup_pow_of_two(psn_nume); bytes += (psn_nume * psn_sz); } bytes = PAGE_ALIGN(bytes); - umem = ib_umem_get(&rdev->ibdev, ureq.qpsva, bytes, + umem = ib_umem_get(&rdev->ibdev, ureq->qpsva, bytes, IB_ACCESS_LOCAL_WRITE); if (IS_ERR(umem)) return PTR_ERR(umem); @@ -1057,12 +1055,12 @@ static int bnxt_re_init_user_qp(struct bnxt_re_dev *rdev, struct bnxt_re_pd *pd, qplib_qp->sq.sg_info.umem = umem; qplib_qp->sq.sg_info.pgsize = PAGE_SIZE; qplib_qp->sq.sg_info.pgshft = PAGE_SHIFT; - qplib_qp->qp_handle = ureq.qp_handle; + qplib_qp->qp_handle = ureq->qp_handle; if (!qp->qplib_qp.srq) { bytes = (qplib_qp->rq.max_wqe * qplib_qp->rq.wqe_size); bytes = PAGE_ALIGN(bytes); - umem = ib_umem_get(&rdev->ibdev, ureq.qprva, bytes, + umem = ib_umem_get(&rdev->ibdev, ureq->qprva, bytes, IB_ACCESS_LOCAL_WRITE); if (IS_ERR(umem)) goto rqfail; @@ -1261,14 +1259,15 @@ static void bnxt_re_adjust_gsi_rq_attr(struct bnxt_re_qp *qp) static int bnxt_re_init_sq_attr(struct bnxt_re_qp *qp, struct ib_qp_init_attr *init_attr, - struct bnxt_re_ucontext *uctx) + struct bnxt_re_ucontext *uctx, + struct bnxt_re_qp_req *ureq) { struct bnxt_qplib_dev_attr *dev_attr; struct bnxt_qplib_qp *qplqp; struct bnxt_re_dev *rdev; struct bnxt_qplib_q *sq; + int diff = 0; int entries; - int diff; int rc; rdev = qp->rdev; @@ -1277,22 +1276,28 @@ static int bnxt_re_init_sq_attr(struct bnxt_re_qp *qp, dev_attr = &rdev->dev_attr; sq->max_sge = init_attr->cap.max_send_sge; - if (sq->max_sge > dev_attr->max_qp_sges) { - sq->max_sge = dev_attr->max_qp_sges; - init_attr->cap.max_send_sge = sq->max_sge; - } + entries = init_attr->cap.max_send_wr; + if (uctx && qplqp->wqe_mode == BNXT_QPLIB_WQE_MODE_VARIABLE) { + sq->max_wqe = ureq->sq_slots; + sq->max_sw_wqe = ureq->sq_slots; + sq->wqe_size = sizeof(struct sq_sge); + } else { + if (sq->max_sge > dev_attr->max_qp_sges) { + sq->max_sge = dev_attr->max_qp_sges; + init_attr->cap.max_send_sge = sq->max_sge; + } - rc = bnxt_re_setup_swqe_size(qp, init_attr); - if (rc) - return rc; + rc = bnxt_re_setup_swqe_size(qp, init_attr); + if (rc) + return rc; - entries = init_attr->cap.max_send_wr; - /* Allocate 128 + 1 more than what's provided */ - diff = (qplqp->wqe_mode == BNXT_QPLIB_WQE_MODE_VARIABLE) ? - 0 : BNXT_QPLIB_RESERVED_QP_WRS; - entries = bnxt_re_init_depth(entries + diff + 1, uctx); - sq->max_wqe = min_t(u32, entries, dev_attr->max_qp_wqes + diff + 1); - sq->max_sw_wqe = bnxt_qplib_get_depth(sq, qplqp->wqe_mode, true); + /* Allocate 128 + 1 more than what's provided */ + diff = (qplqp->wqe_mode == BNXT_QPLIB_WQE_MODE_VARIABLE) ? + 0 : BNXT_QPLIB_RESERVED_QP_WRS; + entries = bnxt_re_init_depth(entries + diff + 1, uctx); + sq->max_wqe = min_t(u32, entries, dev_attr->max_qp_wqes + diff + 1); + sq->max_sw_wqe = bnxt_qplib_get_depth(sq, qplqp->wqe_mode, true); + } sq->q_full_delta = diff + 1; /* * Reserving one slot for Phantom WQE. Application can @@ -1355,10 +1360,10 @@ out: static int bnxt_re_init_qp_attr(struct bnxt_re_qp *qp, struct bnxt_re_pd *pd, struct ib_qp_init_attr *init_attr, - struct ib_udata *udata) + struct bnxt_re_ucontext *uctx, + struct bnxt_re_qp_req *ureq) { struct bnxt_qplib_dev_attr *dev_attr; - struct bnxt_re_ucontext *uctx; struct bnxt_qplib_qp *qplqp; struct bnxt_re_dev *rdev; struct bnxt_re_cq *cq; @@ -1368,7 +1373,6 @@ static int bnxt_re_init_qp_attr(struct bnxt_re_qp *qp, struct bnxt_re_pd *pd, qplqp = &qp->qplib_qp; dev_attr = &rdev->dev_attr; - uctx = rdma_udata_to_drv_context(udata, struct bnxt_re_ucontext, ib_uctx); /* Setup misc params */ ether_addr_copy(qplqp->smac, rdev->netdev->dev_addr); qplqp->pd = &pd->qplib_pd; @@ -1381,8 +1385,7 @@ static int bnxt_re_init_qp_attr(struct bnxt_re_qp *qp, struct bnxt_re_pd *pd, goto out; } qplqp->type = (u8)qptype; - qplqp->wqe_mode = rdev->chip_ctx->modes.wqe_mode; - + qplqp->wqe_mode = bnxt_re_is_var_size_supported(rdev, uctx); if (init_attr->qp_type == IB_QPT_RC) { qplqp->max_rd_atomic = dev_attr->max_qp_rd_atom; qplqp->max_dest_rd_atomic = dev_attr->max_qp_init_rd_atom; @@ -1417,14 +1420,14 @@ static int bnxt_re_init_qp_attr(struct bnxt_re_qp *qp, struct bnxt_re_pd *pd, bnxt_re_adjust_gsi_rq_attr(qp); /* Setup SQ */ - rc = bnxt_re_init_sq_attr(qp, init_attr, uctx); + rc = bnxt_re_init_sq_attr(qp, init_attr, uctx, ureq); if (rc) goto out; if (init_attr->qp_type == IB_QPT_GSI) bnxt_re_adjust_gsi_sq_attr(qp, init_attr, uctx); - if (udata) /* This will update DPI and qp_handle */ - rc = bnxt_re_init_user_qp(rdev, pd, qp, udata); + if (uctx) /* This will update DPI and qp_handle */ + rc = bnxt_re_init_user_qp(rdev, pd, qp, uctx, ureq); out: return rc; } @@ -1525,14 +1528,27 @@ static bool bnxt_re_test_qp_limits(struct bnxt_re_dev *rdev, int bnxt_re_create_qp(struct ib_qp *ib_qp, struct ib_qp_init_attr *qp_init_attr, struct ib_udata *udata) { - struct ib_pd *ib_pd = ib_qp->pd; - struct bnxt_re_pd *pd = container_of(ib_pd, struct bnxt_re_pd, ib_pd); - struct bnxt_re_dev *rdev = pd->rdev; - struct bnxt_qplib_dev_attr *dev_attr = &rdev->dev_attr; - struct bnxt_re_qp *qp = container_of(ib_qp, struct bnxt_re_qp, ib_qp); + struct bnxt_qplib_dev_attr *dev_attr; + struct bnxt_re_ucontext *uctx; + struct bnxt_re_qp_req ureq; + struct bnxt_re_dev *rdev; + struct bnxt_re_pd *pd; + struct bnxt_re_qp *qp; + struct ib_pd *ib_pd; u32 active_qps; int rc; + ib_pd = ib_qp->pd; + pd = container_of(ib_pd, struct bnxt_re_pd, ib_pd); + rdev = pd->rdev; + dev_attr = &rdev->dev_attr; + qp = container_of(ib_qp, struct bnxt_re_qp, ib_qp); + + uctx = rdma_udata_to_drv_context(udata, struct bnxt_re_ucontext, ib_uctx); + if (udata) + if (ib_copy_from_udata(&ureq, udata, min(udata->inlen, sizeof(ureq)))) + return -EFAULT; + rc = bnxt_re_test_qp_limits(rdev, qp_init_attr, dev_attr); if (!rc) { rc = -EINVAL; @@ -1540,7 +1556,7 @@ int bnxt_re_create_qp(struct ib_qp *ib_qp, struct ib_qp_init_attr *qp_init_attr, } qp->rdev = rdev; - rc = bnxt_re_init_qp_attr(qp, pd, qp_init_attr, udata); + rc = bnxt_re_init_qp_attr(qp, pd, qp_init_attr, uctx, &ureq); if (rc) goto fail; @@ -4215,7 +4231,7 @@ int bnxt_re_alloc_ucontext(struct ib_ucontext *ctx, struct ib_udata *udata) goto cfail; if (ureq.comp_mask & BNXT_RE_COMP_MASK_REQ_UCNTX_POW2_SUPPORT) { resp.comp_mask |= BNXT_RE_UCNTX_CMASK_POW2_DISABLED; - uctx->cmask |= BNXT_RE_UCNTX_CMASK_POW2_DISABLED; + uctx->cmask |= BNXT_RE_UCNTX_CAP_POW2_DISABLED; } } diff --git a/drivers/infiniband/hw/bnxt_re/ib_verbs.h b/drivers/infiniband/hw/bnxt_re/ib_verbs.h index e98cb1717338..7c8350fb8aad 100644 --- a/drivers/infiniband/hw/bnxt_re/ib_verbs.h +++ b/drivers/infiniband/hw/bnxt_re/ib_verbs.h @@ -171,12 +171,26 @@ static inline u16 bnxt_re_get_rwqe_size(int nsge) return sizeof(struct rq_wqe_hdr) + (nsge * sizeof(struct sq_sge)); } +enum { + BNXT_RE_UCNTX_CAP_POW2_DISABLED = 0x1ULL, + BNXT_RE_UCNTX_CAP_VAR_WQE_ENABLED = 0x2ULL, +}; + static inline u32 bnxt_re_init_depth(u32 ent, struct bnxt_re_ucontext *uctx) { - return uctx ? (uctx->cmask & BNXT_RE_UCNTX_CMASK_POW2_DISABLED) ? + return uctx ? (uctx->cmask & BNXT_RE_UCNTX_CAP_POW2_DISABLED) ? ent : roundup_pow_of_two(ent) : ent; } +static inline bool bnxt_re_is_var_size_supported(struct bnxt_re_dev *rdev, + struct bnxt_re_ucontext *uctx) +{ + if (uctx) + return uctx->cmask & BNXT_RE_UCNTX_CAP_VAR_WQE_ENABLED; + else + return rdev->chip_ctx->modes.wqe_mode; +} + int bnxt_re_query_device(struct ib_device *ibdev, struct ib_device_attr *ib_attr, struct ib_udata *udata); diff --git a/include/uapi/rdma/bnxt_re-abi.h b/include/uapi/rdma/bnxt_re-abi.h index e61104f35d73..71140618700a 100644 --- a/include/uapi/rdma/bnxt_re-abi.h +++ b/include/uapi/rdma/bnxt_re-abi.h @@ -118,10 +118,16 @@ struct bnxt_re_resize_cq_req { __aligned_u64 cq_va; }; +enum bnxt_re_qp_mask { + BNXT_RE_QP_REQ_MASK_VAR_WQE_SQ_SLOTS = 0x1, +}; + struct bnxt_re_qp_req { __aligned_u64 qpsva; __aligned_u64 qprva; __aligned_u64 qp_handle; + __aligned_u64 comp_mask; + __u32 sq_slots; }; struct bnxt_re_qp_resp { -- cgit v1.2.3 From 10a104c0debbb19a1e45193d5670510216e339ff Mon Sep 17 00:00:00 2001 From: Selvin Xavier Date: Sun, 18 Aug 2024 21:47:27 -0700 Subject: RDMA/bnxt_re: Enable variable size WQEs for user space applications Add backward compatibility code to enable variable size WQEs only if the user lib supports it. Link: https://patch.msgid.link/r/1724042847-1481-6-git-send-email-selvin.xavier@broadcom.com Signed-off-by: Hongguang Gao Signed-off-by: Selvin Xavier Signed-off-by: Jason Gunthorpe --- drivers/infiniband/hw/bnxt_re/ib_verbs.c | 5 +++++ include/uapi/rdma/bnxt_re-abi.h | 1 + 2 files changed, 6 insertions(+) (limited to 'include/uapi') diff --git a/drivers/infiniband/hw/bnxt_re/ib_verbs.c b/drivers/infiniband/hw/bnxt_re/ib_verbs.c index 2932db129958..82444fd748f1 100644 --- a/drivers/infiniband/hw/bnxt_re/ib_verbs.c +++ b/drivers/infiniband/hw/bnxt_re/ib_verbs.c @@ -4233,6 +4233,11 @@ int bnxt_re_alloc_ucontext(struct ib_ucontext *ctx, struct ib_udata *udata) resp.comp_mask |= BNXT_RE_UCNTX_CMASK_POW2_DISABLED; uctx->cmask |= BNXT_RE_UCNTX_CAP_POW2_DISABLED; } + if (ureq.comp_mask & BNXT_RE_COMP_MASK_REQ_UCNTX_VAR_WQE_SUPPORT) { + resp.comp_mask |= BNXT_RE_UCNTX_CMASK_HAVE_MODE; + resp.mode = rdev->chip_ctx->modes.wqe_mode; + uctx->cmask |= BNXT_RE_UCNTX_CAP_VAR_WQE_ENABLED; + } } rc = ib_copy_to_udata(udata, &resp, min(udata->outlen, sizeof(resp))); diff --git a/include/uapi/rdma/bnxt_re-abi.h b/include/uapi/rdma/bnxt_re-abi.h index 71140618700a..6821002931c8 100644 --- a/include/uapi/rdma/bnxt_re-abi.h +++ b/include/uapi/rdma/bnxt_re-abi.h @@ -66,6 +66,7 @@ enum bnxt_re_wqe_mode { enum { BNXT_RE_COMP_MASK_REQ_UCNTX_POW2_SUPPORT = 0x01, + BNXT_RE_COMP_MASK_REQ_UCNTX_VAR_WQE_SUPPORT = 0x02, }; struct bnxt_re_uctx_req { -- cgit v1.2.3 From 947697c6f0f75f9866f2f891a102dece7a09a064 Mon Sep 17 00:00:00 2001 From: Anshuman Khandual Date: Thu, 22 Aug 2024 10:18:52 +0530 Subject: uapi: Define GENMASK_U128 This adds GENMASK_U128() and __GENMASK_U128() macros using __BITS_PER_U128 and __int128 data types. These macros will be used in providing support for generating 128 bit masks. The macros wouldn't work in all assembler flavors for reasons described in the comments on top of declarations. Enforce it for more by adding !__ASSEMBLY__ guard. Cc: Yury Norov Cc: Rasmus Villemoes Cc: Arnd Bergmann > Cc: linux-kernel@vger.kernel.org Cc: linux-arch@vger.kernel.org Reviewed-by: Arnd Bergmann Signed-off-by: Anshuman Khandual Signed-off-by: Yury Norov --- include/linux/bits.h | 15 +++++++++++++++ include/uapi/linux/bits.h | 3 +++ include/uapi/linux/const.h | 17 +++++++++++++++++ 3 files changed, 35 insertions(+) (limited to 'include/uapi') diff --git a/include/linux/bits.h b/include/linux/bits.h index 0eb24d21aac2..60044b608817 100644 --- a/include/linux/bits.h +++ b/include/linux/bits.h @@ -36,4 +36,19 @@ #define GENMASK_ULL(h, l) \ (GENMASK_INPUT_CHECK(h, l) + __GENMASK_ULL(h, l)) +#if !defined(__ASSEMBLY__) +/* + * Missing asm support + * + * __GENMASK_U128() depends on _BIT128() which would not work + * in the asm code, as it shifts an 'unsigned __init128' data + * type instead of direct representation of 128 bit constants + * such as long and unsigned long. The fundamental problem is + * that a 128 bit constant will get silently truncated by the + * gcc compiler. + */ +#define GENMASK_U128(h, l) \ + (GENMASK_INPUT_CHECK(h, l) + __GENMASK_U128(h, l)) +#endif + #endif /* __LINUX_BITS_H */ diff --git a/include/uapi/linux/bits.h b/include/uapi/linux/bits.h index 3c2a101986a3..5ee30f882736 100644 --- a/include/uapi/linux/bits.h +++ b/include/uapi/linux/bits.h @@ -12,4 +12,7 @@ (((~_ULL(0)) - (_ULL(1) << (l)) + 1) & \ (~_ULL(0) >> (__BITS_PER_LONG_LONG - 1 - (h)))) +#define __GENMASK_U128(h, l) \ + ((_BIT128((h)) << 1) - (_BIT128(l))) + #endif /* _UAPI_LINUX_BITS_H */ diff --git a/include/uapi/linux/const.h b/include/uapi/linux/const.h index a429381e7ca5..e16be0d37746 100644 --- a/include/uapi/linux/const.h +++ b/include/uapi/linux/const.h @@ -28,6 +28,23 @@ #define _BITUL(x) (_UL(1) << (x)) #define _BITULL(x) (_ULL(1) << (x)) +#if !defined(__ASSEMBLY__) +/* + * Missing asm support + * + * __BIT128() would not work in the asm code, as it shifts an + * 'unsigned __init128' data type as direct representation of + * 128 bit constants is not supported in the gcc compiler, as + * they get silently truncated. + * + * TODO: Please revisit this implementation when gcc compiler + * starts representing 128 bit constants directly like long + * and unsigned long etc. Subsequently drop the comment for + * GENMASK_U128() which would then start supporting asm code. + */ +#define _BIT128(x) ((unsigned __int128)(1) << (x)) +#endif + #define __ALIGN_KERNEL(x, a) __ALIGN_KERNEL_MASK(x, (__typeof__(x))(a) - 1) #define __ALIGN_KERNEL_MASK(x, mask) (((x) + (mask)) & ~(mask)) -- cgit v1.2.3 From 57413d8e172c10c90fbd91f98d0f7d8eb27e824c Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Tue, 27 Aug 2024 08:50:47 +0200 Subject: fs: sort out the fallocate mode vs flag mess The fallocate system call takes a mode argument, but that argument contains a wild mix of exclusive modes and an optional flags. Replace FALLOC_FL_SUPPORTED_MASK with FALLOC_FL_MODE_MASK, which excludes the optional flag bit, so that we can use switch statement on the value to easily enumerate the cases while getting the check for duplicate modes for free. To make this (and in the future the file system implementations) more readable also add a symbolic name for the 0 mode used to allocate blocks. Signed-off-by: Christoph Hellwig Link: https://lore.kernel.org/r/20240827065123.1762168-4-hch@lst.de Reviewed-by: Darrick J. Wong Reviewed-by: Jan Kara Signed-off-by: Christian Brauner --- fs/open.c | 51 ++++++++++++++++++++++----------------------- include/linux/falloc.h | 18 ++++++++++------ include/uapi/linux/falloc.h | 1 + 3 files changed, 38 insertions(+), 32 deletions(-) (limited to 'include/uapi') diff --git a/fs/open.c b/fs/open.c index 22adbef7ecc2..daf1b55ca818 100644 --- a/fs/open.c +++ b/fs/open.c @@ -252,40 +252,39 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len) if (offset < 0 || len <= 0) return -EINVAL; - /* Return error if mode is not supported */ - if (mode & ~FALLOC_FL_SUPPORTED_MASK) + if (mode & ~(FALLOC_FL_MODE_MASK | FALLOC_FL_KEEP_SIZE)) return -EOPNOTSUPP; - /* Punch hole and zero range are mutually exclusive */ - if ((mode & (FALLOC_FL_PUNCH_HOLE | FALLOC_FL_ZERO_RANGE)) == - (FALLOC_FL_PUNCH_HOLE | FALLOC_FL_ZERO_RANGE)) - return -EOPNOTSUPP; - - /* Punch hole must have keep size set */ - if ((mode & FALLOC_FL_PUNCH_HOLE) && - !(mode & FALLOC_FL_KEEP_SIZE)) + /* + * Modes are exclusive, even if that is not obvious from the encoding + * as bit masks and the mix with the flag in the same namespace. + * + * To make things even more complicated, FALLOC_FL_ALLOCATE_RANGE is + * encoded as no bit set. + */ + switch (mode & FALLOC_FL_MODE_MASK) { + case FALLOC_FL_ALLOCATE_RANGE: + case FALLOC_FL_UNSHARE_RANGE: + case FALLOC_FL_ZERO_RANGE: + break; + case FALLOC_FL_PUNCH_HOLE: + if (!(mode & FALLOC_FL_KEEP_SIZE)) + return -EOPNOTSUPP; + break; + case FALLOC_FL_COLLAPSE_RANGE: + case FALLOC_FL_INSERT_RANGE: + if (mode & FALLOC_FL_KEEP_SIZE) + return -EOPNOTSUPP; + break; + default: return -EOPNOTSUPP; - - /* Collapse range should only be used exclusively. */ - if ((mode & FALLOC_FL_COLLAPSE_RANGE) && - (mode & ~FALLOC_FL_COLLAPSE_RANGE)) - return -EINVAL; - - /* Insert range should only be used exclusively. */ - if ((mode & FALLOC_FL_INSERT_RANGE) && - (mode & ~FALLOC_FL_INSERT_RANGE)) - return -EINVAL; - - /* Unshare range should only be used with allocate mode. */ - if ((mode & FALLOC_FL_UNSHARE_RANGE) && - (mode & ~(FALLOC_FL_UNSHARE_RANGE | FALLOC_FL_KEEP_SIZE))) - return -EINVAL; + } if (!(file->f_mode & FMODE_WRITE)) return -EBADF; /* - * We can only allow pure fallocate on append only files + * On append-only files only space preallocation is supported. */ if ((mode & ~FALLOC_FL_KEEP_SIZE) && IS_APPEND(inode)) return -EPERM; diff --git a/include/linux/falloc.h b/include/linux/falloc.h index f3f0b97b1675..3f49f3df6af5 100644 --- a/include/linux/falloc.h +++ b/include/linux/falloc.h @@ -25,12 +25,18 @@ struct space_resv { #define FS_IOC_UNRESVSP64 _IOW('X', 43, struct space_resv) #define FS_IOC_ZERO_RANGE _IOW('X', 57, struct space_resv) -#define FALLOC_FL_SUPPORTED_MASK (FALLOC_FL_KEEP_SIZE | \ - FALLOC_FL_PUNCH_HOLE | \ - FALLOC_FL_COLLAPSE_RANGE | \ - FALLOC_FL_ZERO_RANGE | \ - FALLOC_FL_INSERT_RANGE | \ - FALLOC_FL_UNSHARE_RANGE) +/* + * Mask of all supported fallocate modes. Only one can be set at a time. + * + * In addition to the mode bit, the mode argument can also encode flags. + * FALLOC_FL_KEEP_SIZE is the only supported flag so far. + */ +#define FALLOC_FL_MODE_MASK (FALLOC_FL_ALLOCATE_RANGE | \ + FALLOC_FL_PUNCH_HOLE | \ + FALLOC_FL_COLLAPSE_RANGE | \ + FALLOC_FL_ZERO_RANGE | \ + FALLOC_FL_INSERT_RANGE | \ + FALLOC_FL_UNSHARE_RANGE) /* on ia32 l_start is on a 32-bit boundary */ #if defined(CONFIG_X86_64) diff --git a/include/uapi/linux/falloc.h b/include/uapi/linux/falloc.h index 51398fa57f6c..5810371ed72b 100644 --- a/include/uapi/linux/falloc.h +++ b/include/uapi/linux/falloc.h @@ -2,6 +2,7 @@ #ifndef _UAPI_FALLOC_H_ #define _UAPI_FALLOC_H_ +#define FALLOC_FL_ALLOCATE_RANGE 0x00 /* allocate range */ #define FALLOC_FL_KEEP_SIZE 0x01 /* default is extend size */ #define FALLOC_FL_PUNCH_HOLE 0x02 /* de-allocates range */ #define FALLOC_FL_NO_HIDE_STALE 0x04 /* reserved codepoint */ -- cgit v1.2.3 From b31c9d9dc343146b9f4ce67b4eee748c49296e99 Mon Sep 17 00:00:00 2001 From: Peter Hutterer Date: Tue, 27 Aug 2024 17:19:29 +0900 Subject: HID: hidraw: add HIDIOCREVOKE ioctl There is a need for userspace applications to open HID devices directly. Use-cases include configuration of gaming mice or direct access to joystick devices. The latter is currently handled by the uaccess tag in systemd, other devices include more custom/local configurations or just sudo. A better approach is what we already have for evdev devices: give the application a file descriptor and revoke it when it may no longer access that device. This patch is the hidraw equivalent to the EVIOCREVOKE ioctl, see commit c7dc65737c9a ("Input: evdev - add EVIOCREVOKE ioctl") for full details. An MR for systemd-logind has been filed here: https://github.com/systemd/systemd/pull/33970 Signed-off-by: Peter Hutterer Link: https://patch.msgid.link/20240827-hidraw-revoke-v5-1-d004a7451aea@kernel.org Signed-off-by: Benjamin Tissoires --- drivers/hid/hidraw.c | 39 +++++++++++++++++++++++++++++++++++---- include/linux/hidraw.h | 1 + include/uapi/linux/hidraw.h | 1 + 3 files changed, 37 insertions(+), 4 deletions(-) (limited to 'include/uapi') diff --git a/drivers/hid/hidraw.c b/drivers/hid/hidraw.c index 716294e40e8a..c887f48756f4 100644 --- a/drivers/hid/hidraw.c +++ b/drivers/hid/hidraw.c @@ -38,12 +38,20 @@ static const struct class hidraw_class = { static struct hidraw *hidraw_table[HIDRAW_MAX_DEVICES]; static DECLARE_RWSEM(minors_rwsem); +static inline bool hidraw_is_revoked(struct hidraw_list *list) +{ + return list->revoked; +} + static ssize_t hidraw_read(struct file *file, char __user *buffer, size_t count, loff_t *ppos) { struct hidraw_list *list = file->private_data; int ret = 0, len; DECLARE_WAITQUEUE(wait, current); + if (hidraw_is_revoked(list)) + return -ENODEV; + mutex_lock(&list->read_mutex); while (ret == 0) { @@ -161,9 +169,13 @@ out: static ssize_t hidraw_write(struct file *file, const char __user *buffer, size_t count, loff_t *ppos) { + struct hidraw_list *list = file->private_data; ssize_t ret; down_read(&minors_rwsem); - ret = hidraw_send_report(file, buffer, count, HID_OUTPUT_REPORT); + if (hidraw_is_revoked(list)) + ret = -ENODEV; + else + ret = hidraw_send_report(file, buffer, count, HID_OUTPUT_REPORT); up_read(&minors_rwsem); return ret; } @@ -256,7 +268,7 @@ static __poll_t hidraw_poll(struct file *file, poll_table *wait) poll_wait(file, &list->hidraw->wait, wait); if (list->head != list->tail) mask |= EPOLLIN | EPOLLRDNORM; - if (!list->hidraw->exist) + if (!list->hidraw->exist || hidraw_is_revoked(list)) mask |= EPOLLERR | EPOLLHUP; return mask; } @@ -320,6 +332,9 @@ static int hidraw_fasync(int fd, struct file *file, int on) { struct hidraw_list *list = file->private_data; + if (hidraw_is_revoked(list)) + return -ENODEV; + return fasync_helper(fd, file, on, &list->fasync); } @@ -372,6 +387,13 @@ static int hidraw_release(struct inode * inode, struct file * file) return 0; } +static int hidraw_revoke(struct hidraw_list *list) +{ + list->revoked = true; + + return 0; +} + static long hidraw_ioctl(struct file *file, unsigned int cmd, unsigned long arg) { @@ -379,11 +401,12 @@ static long hidraw_ioctl(struct file *file, unsigned int cmd, unsigned int minor = iminor(inode); long ret = 0; struct hidraw *dev; + struct hidraw_list *list = file->private_data; void __user *user_arg = (void __user*) arg; down_read(&minors_rwsem); dev = hidraw_table[minor]; - if (!dev || !dev->exist) { + if (!dev || !dev->exist || hidraw_is_revoked(list)) { ret = -ENODEV; goto out; } @@ -421,6 +444,14 @@ static long hidraw_ioctl(struct file *file, unsigned int cmd, ret = -EFAULT; break; } + case HIDIOCREVOKE: + { + if (user_arg) + ret = -EINVAL; + else + ret = hidraw_revoke(list); + break; + } default: { struct hid_device *hid = dev->hid; @@ -527,7 +558,7 @@ int hidraw_report_event(struct hid_device *hid, u8 *data, int len) list_for_each_entry(list, &dev->list, node) { int new_head = (list->head + 1) & (HIDRAW_BUFFER_SIZE - 1); - if (new_head == list->tail) + if (hidraw_is_revoked(list) || new_head == list->tail) continue; if (!(list->buffer[list->head].value = kmemdup(data, len, GFP_ATOMIC))) { diff --git a/include/linux/hidraw.h b/include/linux/hidraw.h index cd67f4ca5599..18fd30a288de 100644 --- a/include/linux/hidraw.h +++ b/include/linux/hidraw.h @@ -32,6 +32,7 @@ struct hidraw_list { struct hidraw *hidraw; struct list_head node; struct mutex read_mutex; + bool revoked; }; #ifdef CONFIG_HIDRAW diff --git a/include/uapi/linux/hidraw.h b/include/uapi/linux/hidraw.h index 33ebad81720a..d5ee269864e0 100644 --- a/include/uapi/linux/hidraw.h +++ b/include/uapi/linux/hidraw.h @@ -46,6 +46,7 @@ struct hidraw_devinfo { /* The first byte of SOUTPUT and GOUTPUT is the report number */ #define HIDIOCSOUTPUT(len) _IOC(_IOC_WRITE|_IOC_READ, 'H', 0x0B, len) #define HIDIOCGOUTPUT(len) _IOC(_IOC_WRITE|_IOC_READ, 'H', 0x0C, len) +#define HIDIOCREVOKE _IOW('H', 0x0D, int) /* Revoke device access */ #define HIDRAW_FIRST_MINOR 0 #define HIDRAW_MAX_DEVICES 64 -- cgit v1.2.3 From ae98dbf43d755b4e111fcd086e53939bef3e9a1a Mon Sep 17 00:00:00 2001 From: Jens Axboe Date: Fri, 9 Aug 2024 11:20:45 -0600 Subject: io_uring/kbuf: add support for incremental buffer consumption By default, any recv/read operation that uses provided buffers will consume at least 1 buffer fully (and maybe more, in case of bundles). This adds support for incremental consumption, meaning that an application may add large buffers, and each read/recv will just consume the part of the buffer that it needs. For example, let's say an application registers 1MB buffers in a provided buffer ring, for streaming receives. If it gets a short recv, then the full 1MB buffer will be consumed and passed back to the application. With incremental consumption, only the part that was actually used is consumed, and the buffer remains the current one. This means that both the application and the kernel needs to keep track of what the current receive point is. Each recv will still pass back a buffer ID and the size consumed, the only difference is that before the next receive would always be the next buffer in the ring. Now the same buffer ID may return multiple receives, each at an offset into that buffer from where the previous receive left off. Example: Application registers a provided buffer ring, and adds two 32K buffers to the ring. Buffer1 address: 0x1000000 (buffer ID 0) Buffer2 address: 0x2000000 (buffer ID 1) A recv completion is received with the following values: cqe->res 0x1000 (4k bytes received) cqe->flags 0x11 (CQE_F_BUFFER|CQE_F_BUF_MORE set, buffer ID 0) and the application now knows that 4096b of data is available at 0x1000000, the start of that buffer, and that more data from this buffer will be coming. Now the next receive comes in: cqe->res 0x2010 (8k bytes received) cqe->flags 0x11 (CQE_F_BUFFER|CQE_F_BUF_MORE set, buffer ID 0) which tells the application that 8k is available where the last completion left off, at 0x1001000. Next completion is: cqe->res 0x5000 (20k bytes received) cqe->flags 0x1 (CQE_F_BUFFER set, buffer ID 0) and the application now knows that 20k of data is available at 0x1003000, which is where the previous receive ended. CQE_F_BUF_MORE isn't set, as no more data is available in this buffer ID. The next completion is then: cqe->res 0x1000 (4k bytes received) cqe->flags 0x10001 (CQE_F_BUFFER|CQE_F_BUF_MORE set, buffer ID 1) which tells the application that buffer ID 1 is now the current one, hence there's 4k of valid data at 0x2000000. 0x2001000 will be the next receive point for this buffer ID. When a buffer will be reused by future CQE completions, IORING_CQE_BUF_MORE will be set in cqe->flags. This tells the application that the kernel isn't done with the buffer yet, and that it should expect more completions for this buffer ID. Will only be set by provided buffer rings setup with IOU_PBUF_RING INC, as that's the only type of buffer that will see multiple consecutive completions for the same buffer ID. For any other provided buffer type, any completion that passes back a buffer to the application is final. Once a buffer has been fully consumed, the buffer ring head is incremented and the next receive will indicate the next buffer ID in the CQE cflags. On the send side, the application can manage how much data is sent from an existing buffer by setting sqe->len to the desired send length. An application can request incremental consumption by setting IOU_PBUF_RING_INC in the provided buffer ring registration. Outside of that, any provided buffer ring setup and buffer additions is done like before, no changes there. The only change is in how an application may see multiple completions for the same buffer ID, hence needing to know where the next receive will happen. Note that like existing provided buffer rings, this should not be used with IOSQE_ASYNC, as both really require the ring to remain locked over the duration of the buffer selection and the operation completion. It will consume a buffer otherwise regardless of the size of the IO done. Signed-off-by: Jens Axboe --- include/uapi/linux/io_uring.h | 18 ++++++++++++++++++ io_uring/kbuf.c | 42 ++++++++++++++++++++++++++++++------------ io_uring/kbuf.h | 42 ++++++++++++++++++++++++++++++++++-------- 3 files changed, 82 insertions(+), 20 deletions(-) (limited to 'include/uapi') diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 042eab793e26..a275f91d2ac0 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -440,11 +440,21 @@ struct io_uring_cqe { * IORING_CQE_F_SOCK_NONEMPTY If set, more data to read after socket recv * IORING_CQE_F_NOTIF Set for notification CQEs. Can be used to distinct * them from sends. + * IORING_CQE_F_BUF_MORE If set, the buffer ID set in the completion will get + * more completions. In other words, the buffer is being + * partially consumed, and will be used by the kernel for + * more completions. This is only set for buffers used via + * the incremental buffer consumption, as provided by + * a ring buffer setup with IOU_PBUF_RING_INC. For any + * other provided buffer type, all completions with a + * buffer passed back is automatically returned to the + * application. */ #define IORING_CQE_F_BUFFER (1U << 0) #define IORING_CQE_F_MORE (1U << 1) #define IORING_CQE_F_SOCK_NONEMPTY (1U << 2) #define IORING_CQE_F_NOTIF (1U << 3) +#define IORING_CQE_F_BUF_MORE (1U << 4) #define IORING_CQE_BUFFER_SHIFT 16 @@ -716,9 +726,17 @@ struct io_uring_buf_ring { * mmap(2) with the offset set as: * IORING_OFF_PBUF_RING | (bgid << IORING_OFF_PBUF_SHIFT) * to get a virtual mapping for the ring. + * IOU_PBUF_RING_INC: If set, buffers consumed from this buffer ring can be + * consumed incrementally. Normally one (or more) buffers + * are fully consumed. With incremental consumptions, it's + * feasible to register big ranges of buffers, and each + * use of it will consume only as much as it needs. This + * requires that both the kernel and application keep + * track of where the current read/recv index is at. */ enum io_uring_register_pbuf_ring_flags { IOU_PBUF_RING_MMAP = 1, + IOU_PBUF_RING_INC = 2, }; /* argument for IORING_(UN)REGISTER_PBUF_RING */ diff --git a/io_uring/kbuf.c b/io_uring/kbuf.c index 55d01861d8c5..1f503bcc9c9f 100644 --- a/io_uring/kbuf.c +++ b/io_uring/kbuf.c @@ -212,14 +212,25 @@ static int io_ring_buffers_peek(struct io_kiocb *req, struct buf_sel_arg *arg, buf = io_ring_head_to_buf(br, head, bl->mask); if (arg->max_len) { u32 len = READ_ONCE(buf->len); - size_t needed; if (unlikely(!len)) return -ENOBUFS; - needed = (arg->max_len + len - 1) / len; - needed = min_not_zero(needed, (size_t) PEEK_MAX_IMPORT); - if (nr_avail > needed) - nr_avail = needed; + /* + * Limit incremental buffers to 1 segment. No point trying + * to peek ahead and map more than we need, when the buffers + * themselves should be large when setup with + * IOU_PBUF_RING_INC. + */ + if (bl->flags & IOBL_INC) { + nr_avail = 1; + } else { + size_t needed; + + needed = (arg->max_len + len - 1) / len; + needed = min_not_zero(needed, (size_t) PEEK_MAX_IMPORT); + if (nr_avail > needed) + nr_avail = needed; + } } /* @@ -244,16 +255,21 @@ static int io_ring_buffers_peek(struct io_kiocb *req, struct buf_sel_arg *arg, req->buf_index = buf->bid; do { - /* truncate end piece, if needed */ - if (buf->len > arg->max_len) - buf->len = arg->max_len; + u32 len = buf->len; + + /* truncate end piece, if needed, for non partial buffers */ + if (len > arg->max_len) { + len = arg->max_len; + if (!(bl->flags & IOBL_INC)) + buf->len = len; + } iov->iov_base = u64_to_user_ptr(buf->addr); - iov->iov_len = buf->len; + iov->iov_len = len; iov++; - arg->out_len += buf->len; - arg->max_len -= buf->len; + arg->out_len += len; + arg->max_len -= len; if (!arg->max_len) break; @@ -675,7 +691,7 @@ int io_register_pbuf_ring(struct io_ring_ctx *ctx, void __user *arg) if (reg.resv[0] || reg.resv[1] || reg.resv[2]) return -EINVAL; - if (reg.flags & ~IOU_PBUF_RING_MMAP) + if (reg.flags & ~(IOU_PBUF_RING_MMAP | IOU_PBUF_RING_INC)) return -EINVAL; if (!(reg.flags & IOU_PBUF_RING_MMAP)) { if (!reg.ring_addr) @@ -713,6 +729,8 @@ int io_register_pbuf_ring(struct io_ring_ctx *ctx, void __user *arg) if (!ret) { bl->nr_entries = reg.ring_entries; bl->mask = reg.ring_entries - 1; + if (reg.flags & IOU_PBUF_RING_INC) + bl->flags |= IOBL_INC; io_buffer_add_list(ctx, bl, reg.bgid); return 0; diff --git a/io_uring/kbuf.h b/io_uring/kbuf.h index b41e2a0a0505..36aadfe5ac00 100644 --- a/io_uring/kbuf.h +++ b/io_uring/kbuf.h @@ -9,6 +9,9 @@ enum { IOBL_BUF_RING = 1, /* ring mapped provided buffers, but mmap'ed by application */ IOBL_MMAP = 2, + /* buffers are consumed incrementally rather than always fully */ + IOBL_INC = 4, + }; struct io_buffer_list { @@ -124,24 +127,45 @@ static inline bool io_kbuf_recycle(struct io_kiocb *req, unsigned issue_flags) /* Mapped buffer ring, return io_uring_buf from head */ #define io_ring_head_to_buf(br, head, mask) &(br)->bufs[(head) & (mask)] -static inline void io_kbuf_commit(struct io_kiocb *req, +static inline bool io_kbuf_commit(struct io_kiocb *req, struct io_buffer_list *bl, int len, int nr) { if (unlikely(!(req->flags & REQ_F_BUFFERS_COMMIT))) - return; - bl->head += nr; + return true; + req->flags &= ~REQ_F_BUFFERS_COMMIT; + + if (unlikely(len < 0)) + return true; + + if (bl->flags & IOBL_INC) { + struct io_uring_buf *buf; + + buf = io_ring_head_to_buf(bl->buf_ring, bl->head, bl->mask); + if (WARN_ON_ONCE(len > buf->len)) + len = buf->len; + buf->len -= len; + if (buf->len) { + buf->addr += len; + return false; + } + } + + bl->head += nr; + return true; } -static inline void __io_put_kbuf_ring(struct io_kiocb *req, int len, int nr) +static inline bool __io_put_kbuf_ring(struct io_kiocb *req, int len, int nr) { struct io_buffer_list *bl = req->buf_list; + bool ret = true; if (bl) { - io_kbuf_commit(req, bl, len, nr); + ret = io_kbuf_commit(req, bl, len, nr); req->buf_index = bl->bgid; } req->flags &= ~REQ_F_BUFFER_RING; + return ret; } static inline void __io_put_kbuf_list(struct io_kiocb *req, int len, @@ -176,10 +200,12 @@ static inline unsigned int __io_put_kbufs(struct io_kiocb *req, int len, return 0; ret = IORING_CQE_F_BUFFER | (req->buf_index << IORING_CQE_BUFFER_SHIFT); - if (req->flags & REQ_F_BUFFER_RING) - __io_put_kbuf_ring(req, len, nbufs); - else + if (req->flags & REQ_F_BUFFER_RING) { + if (!__io_put_kbuf_ring(req, len, nbufs)) + ret |= IORING_CQE_F_BUF_MORE; + } else { __io_put_kbuf(req, len, issue_flags); + } return ret; } -- cgit v1.2.3 From 433f9d76a01056dfeaefc15167b11e514e56f956 Mon Sep 17 00:00:00 2001 From: Ian Kent Date: Wed, 14 Aug 2024 17:02:31 +0800 Subject: autofs: add per dentry expire timeout Add ability to set per-dentry mount expire timeout to autofs. There are two fairly well known automounter map formats, the autofs format and the amd format (more or less System V and Berkley). Some time ago Linux autofs added an amd map format parser that implemented a fair amount of the amd functionality. This was done within the autofs infrastructure and some functionality wasn't implemented because it either didn't make sense or required extra kernel changes. The idea was to restrict changes to be within the existing autofs functionality as much as possible and leave changes with a wider scope to be considered later. One of these changes is implementing the amd options: 1) "unmount", expire this mount according to a timeout (same as the current autofs default). 2) "nounmount", don't expire this mount (same as setting the autofs timeout to 0 except only for this specific mount) . 3) "utimeout=", expire this mount using the specified timeout (again same as setting the autofs timeout but only for this mount). To implement these options per-dentry expire timeouts need to be implemented for autofs indirect mounts. This is because all map keys (mounts) for autofs indirect mounts use an expire timeout stored in the autofs mount super block info. structure and all indirect mounts use the same expire timeout. Now I have a request to add the "nounmount" option so I need to add the per-dentry expire handling to the kernel implementation to do this. The implementation uses the trailing path component to identify the mount (and is also used as the autofs map key) which is passed in the autofs_dev_ioctl structure path field. The expire timeout is passed in autofs_dev_ioctl timeout field (well, of the timeout union). If the passed in timeout is equal to -1 the per-dentry timeout and flag are cleared providing for the "unmount" option. If the timeout is greater than or equal to 0 the timeout is set to the value and the flag is also set. If the dentry timeout is 0 the dentry will not expire by timeout which enables the implementation of the "nounmount" option for the specific mount. When the dentry timeout is greater than zero it allows for the implementation of the "utimeout=" option. Signed-off-by: Ian Kent Link: https://lore.kernel.org/r/20240814090231.963520-1-raven@themaw.net Signed-off-by: Christian Brauner --- fs/autofs/autofs_i.h | 4 ++ fs/autofs/dev-ioctl.c | 97 +++++++++++++++++++++++++++++++++++++++++--- fs/autofs/expire.c | 7 +++- fs/autofs/inode.c | 2 + include/uapi/linux/auto_fs.h | 2 +- 5 files changed, 104 insertions(+), 8 deletions(-) (limited to 'include/uapi') diff --git a/fs/autofs/autofs_i.h b/fs/autofs/autofs_i.h index 8c1d587b3eef..77c7991d89aa 100644 --- a/fs/autofs/autofs_i.h +++ b/fs/autofs/autofs_i.h @@ -62,6 +62,7 @@ struct autofs_info { struct list_head expiring; struct autofs_sb_info *sbi; + unsigned long exp_timeout; unsigned long last_used; int count; @@ -81,6 +82,9 @@ struct autofs_info { */ #define AUTOFS_INF_PENDING (1<<2) /* dentry pending mount */ +#define AUTOFS_INF_EXPIRE_SET (1<<3) /* per-dentry expire timeout set for + this mount point. + */ struct autofs_wait_queue { wait_queue_head_t queue; struct autofs_wait_queue *next; diff --git a/fs/autofs/dev-ioctl.c b/fs/autofs/dev-ioctl.c index 5bf781ea6d67..f011e026358e 100644 --- a/fs/autofs/dev-ioctl.c +++ b/fs/autofs/dev-ioctl.c @@ -128,7 +128,13 @@ static int validate_dev_ioctl(int cmd, struct autofs_dev_ioctl *param) goto out; } + /* Setting the per-dentry expire timeout requires a trailing + * path component, ie. no '/', so invert the logic of the + * check_name() return for AUTOFS_DEV_IOCTL_TIMEOUT_CMD. + */ err = check_name(param->path); + if (cmd == AUTOFS_DEV_IOCTL_TIMEOUT_CMD) + err = err ? 0 : -EINVAL; if (err) { pr_warn("invalid path supplied for cmd(0x%08x)\n", cmd); @@ -396,16 +402,97 @@ static int autofs_dev_ioctl_catatonic(struct file *fp, return 0; } -/* Set the autofs mount timeout */ +/* + * Set the autofs mount expire timeout. + * + * There are two places an expire timeout can be set, in the autofs + * super block info. (this is all that's needed for direct and offset + * mounts because there's a distinct mount corresponding to each of + * these) and per-dentry within within the dentry info. If a per-dentry + * timeout is set it will override the expire timeout set in the parent + * autofs super block info. + * + * If setting the autofs super block expire timeout the autofs_dev_ioctl + * size field will be equal to the autofs_dev_ioctl structure size. If + * setting the per-dentry expire timeout the mount point name is passed + * in the autofs_dev_ioctl path field and the size field updated to + * reflect this. + * + * Setting the autofs mount expire timeout sets the timeout in the super + * block info. struct. Setting the per-dentry timeout does a little more. + * If the timeout is equal to -1 the per-dentry timeout (and flag) is + * cleared which reverts to using the super block timeout, otherwise if + * timeout is 0 the timeout is set to this value and the flag is left + * set which disables expiration for the mount point, lastly the flag + * and the timeout are set enabling the dentry to use this timeout. + */ static int autofs_dev_ioctl_timeout(struct file *fp, struct autofs_sb_info *sbi, struct autofs_dev_ioctl *param) { - unsigned long timeout; + unsigned long timeout = param->timeout.timeout; + + /* If setting the expire timeout for an individual indirect + * mount point dentry the mount trailing component path is + * placed in param->path and param->size adjusted to account + * for it otherwise param->size it is set to the structure + * size. + */ + if (param->size == AUTOFS_DEV_IOCTL_SIZE) { + param->timeout.timeout = sbi->exp_timeout / HZ; + sbi->exp_timeout = timeout * HZ; + } else { + struct dentry *base = fp->f_path.dentry; + struct inode *inode = base->d_inode; + int path_len = param->size - AUTOFS_DEV_IOCTL_SIZE - 1; + struct dentry *dentry; + struct autofs_info *ino; + + if (!autofs_type_indirect(sbi->type)) + return -EINVAL; + + /* An expire timeout greater than the superblock timeout + * could be a problem at shutdown but the super block + * timeout itself can change so all we can really do is + * warn the user. + */ + if (timeout >= sbi->exp_timeout) + pr_warn("per-mount expire timeout is greater than " + "the parent autofs mount timeout which could " + "prevent shutdown\n"); + + inode_lock_shared(inode); + dentry = try_lookup_one_len(param->path, base, path_len); + inode_unlock_shared(inode); + if (IS_ERR_OR_NULL(dentry)) + return dentry ? PTR_ERR(dentry) : -ENOENT; + ino = autofs_dentry_ino(dentry); + if (!ino) { + dput(dentry); + return -ENOENT; + } + + if (ino->exp_timeout && ino->flags & AUTOFS_INF_EXPIRE_SET) + param->timeout.timeout = ino->exp_timeout / HZ; + else + param->timeout.timeout = sbi->exp_timeout / HZ; + + if (timeout == -1) { + /* Revert to using the super block timeout */ + ino->flags &= ~AUTOFS_INF_EXPIRE_SET; + ino->exp_timeout = 0; + } else { + /* Set the dentry expire flag and timeout. + * + * If timeout is 0 it will prevent the expire + * of this particular automount. + */ + ino->flags |= AUTOFS_INF_EXPIRE_SET; + ino->exp_timeout = timeout * HZ; + } + dput(dentry); + } - timeout = param->timeout.timeout; - param->timeout.timeout = sbi->exp_timeout / HZ; - sbi->exp_timeout = timeout * HZ; return 0; } diff --git a/fs/autofs/expire.c b/fs/autofs/expire.c index 39d8c84c16f4..5c2d459e1e48 100644 --- a/fs/autofs/expire.c +++ b/fs/autofs/expire.c @@ -429,8 +429,6 @@ static struct dentry *autofs_expire_indirect(struct super_block *sb, if (!root) return NULL; - timeout = sbi->exp_timeout; - dentry = NULL; while ((dentry = get_next_positive_subdir(dentry, root))) { spin_lock(&sbi->fs_lock); @@ -441,6 +439,11 @@ static struct dentry *autofs_expire_indirect(struct super_block *sb, } spin_unlock(&sbi->fs_lock); + if (ino->flags & AUTOFS_INF_EXPIRE_SET) + timeout = ino->exp_timeout; + else + timeout = sbi->exp_timeout; + expired = should_expire(dentry, mnt, timeout, how); if (!expired) continue; diff --git a/fs/autofs/inode.c b/fs/autofs/inode.c index 64faa6c51f60..ee2edccaef70 100644 --- a/fs/autofs/inode.c +++ b/fs/autofs/inode.c @@ -19,6 +19,7 @@ struct autofs_info *autofs_new_ino(struct autofs_sb_info *sbi) INIT_LIST_HEAD(&ino->expiring); ino->last_used = jiffies; ino->sbi = sbi; + ino->exp_timeout = -1; ino->count = 1; } return ino; @@ -28,6 +29,7 @@ void autofs_clean_ino(struct autofs_info *ino) { ino->uid = GLOBAL_ROOT_UID; ino->gid = GLOBAL_ROOT_GID; + ino->exp_timeout = -1; ino->last_used = jiffies; } diff --git a/include/uapi/linux/auto_fs.h b/include/uapi/linux/auto_fs.h index 1f7925afad2d..8081df849743 100644 --- a/include/uapi/linux/auto_fs.h +++ b/include/uapi/linux/auto_fs.h @@ -23,7 +23,7 @@ #define AUTOFS_MIN_PROTO_VERSION 3 #define AUTOFS_MAX_PROTO_VERSION 5 -#define AUTOFS_PROTO_SUBVERSION 5 +#define AUTOFS_PROTO_SUBVERSION 6 /* * The wait_queue_token (autofs_wqt_t) is part of a structure which is passed -- cgit v1.2.3 From d7eafed3223af19add14b67a390ec2b983d890e0 Mon Sep 17 00:00:00 2001 From: Connor Abbott Date: Wed, 7 Aug 2024 14:04:58 +0100 Subject: drm/msm: Expose expanded UBWC config uapi This adds extra parameters that affect UBWC tiling that will be used by the Mesa implementation of VK_EXT_host_image_copy. Signed-off-by: Connor Abbott Patchwork: https://patchwork.freedesktop.org/patch/607401/ Signed-off-by: Rob Clark --- drivers/gpu/drm/msm/adreno/adreno_gpu.c | 6 ++++++ include/uapi/drm/msm_drm.h | 2 ++ 2 files changed, 8 insertions(+) (limited to 'include/uapi') diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.c b/drivers/gpu/drm/msm/adreno/adreno_gpu.c index 120b23542a95..f742ebefb769 100644 --- a/drivers/gpu/drm/msm/adreno/adreno_gpu.c +++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.c @@ -379,6 +379,12 @@ int adreno_get_param(struct msm_gpu *gpu, struct msm_file_private *ctx, case MSM_PARAM_RAYTRACING: *value = adreno_gpu->has_ray_tracing; return 0; + case MSM_PARAM_UBWC_SWIZZLE: + *value = adreno_gpu->ubwc_config.ubwc_swizzle; + return 0; + case MSM_PARAM_MACROTILE_MODE: + *value = adreno_gpu->ubwc_config.macrotile_mode; + return 0; default: DBG("%s: invalid param: %u", gpu->name, param); return -EINVAL; diff --git a/include/uapi/drm/msm_drm.h b/include/uapi/drm/msm_drm.h index 3fca72f73861..2377147b6af0 100644 --- a/include/uapi/drm/msm_drm.h +++ b/include/uapi/drm/msm_drm.h @@ -88,6 +88,8 @@ struct drm_msm_timespec { #define MSM_PARAM_VA_SIZE 0x0f /* RO: size of valid GPU iova range (bytes) */ #define MSM_PARAM_HIGHEST_BANK_BIT 0x10 /* RO */ #define MSM_PARAM_RAYTRACING 0x11 /* RO */ +#define MSM_PARAM_UBWC_SWIZZLE 0x12 /* RO */ +#define MSM_PARAM_MACROTILE_MODE 0x13 /* RO */ /* For backwards compat. The original support for preemption was based on * a single ring per priority level so # of priority levels equals the # -- cgit v1.2.3 From 09022bc196d23484a7a5d48cf373f8583e3fcf23 Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Wed, 7 Aug 2024 20:35:26 +0100 Subject: mm: remove PG_error The PG_error bit is now unused; delete it and free up a bit in page->flags. Link: https://lkml.kernel.org/r/20240807193528.1865100-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) Signed-off-by: Andrew Morton --- fs/proc/page.c | 1 - include/linux/page-flags.h | 6 +----- include/trace/events/mmflags.h | 1 - include/uapi/linux/kernel-page-flags.h | 2 +- 4 files changed, 2 insertions(+), 8 deletions(-) (limited to 'include/uapi') diff --git a/fs/proc/page.c b/fs/proc/page.c index b7a5c84b5819..73a0f872d97f 100644 --- a/fs/proc/page.c +++ b/fs/proc/page.c @@ -182,7 +182,6 @@ u64 stable_page_flags(const struct page *page) #endif u |= kpf_copy_bit(k, KPF_LOCKED, PG_locked); - u |= kpf_copy_bit(k, KPF_ERROR, PG_error); u |= kpf_copy_bit(k, KPF_DIRTY, PG_dirty); u |= kpf_copy_bit(k, KPF_UPTODATE, PG_uptodate); u |= kpf_copy_bit(k, KPF_WRITEBACK, PG_writeback); diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 5769fe6e4950..a0a29bd092f8 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -66,8 +66,6 @@ * PG_referenced, PG_reclaim are used for page reclaim for anonymous and * file-backed pagecache (see mm/vmscan.c). * - * PG_error is set to indicate that an I/O error occurred on this page. - * * PG_arch_1 is an architecture specific page state bit. The generic code * guarantees that this bit is cleared for a page when it first is entered into * the page cache. @@ -103,7 +101,6 @@ enum pageflags { PG_waiters, /* Page has waiters, check its waitqueue. Must be bit #7 and in the same byte as "PG_locked" */ PG_active, PG_workingset, - PG_error, PG_owner_priv_1, /* Owner use. If pagecache, fs may use*/ PG_arch_1, PG_reserved, @@ -183,7 +180,7 @@ enum pageflags { */ /* At least one page in this folio has the hwpoison flag set */ - PG_has_hwpoisoned = PG_error, + PG_has_hwpoisoned = PG_active, PG_large_rmappable = PG_workingset, /* anon or file-backed */ }; @@ -506,7 +503,6 @@ static inline int TestClearPage##uname(struct page *page) { return 0; } __PAGEFLAG(Locked, locked, PF_NO_TAIL) FOLIO_FLAG(waiters, FOLIO_HEAD_PAGE) -PAGEFLAG(Error, error, PF_NO_TAIL) TESTCLEARFLAG(Error, error, PF_NO_TAIL) FOLIO_FLAG(referenced, FOLIO_HEAD_PAGE) FOLIO_TEST_CLEAR_FLAG(referenced, FOLIO_HEAD_PAGE) __FOLIO_SET_FLAG(referenced, FOLIO_HEAD_PAGE) diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h index b63d211bd141..b5c4da370a50 100644 --- a/include/trace/events/mmflags.h +++ b/include/trace/events/mmflags.h @@ -100,7 +100,6 @@ #define __def_pageflag_names \ DEF_PAGEFLAG_NAME(locked), \ DEF_PAGEFLAG_NAME(waiters), \ - DEF_PAGEFLAG_NAME(error), \ DEF_PAGEFLAG_NAME(referenced), \ DEF_PAGEFLAG_NAME(uptodate), \ DEF_PAGEFLAG_NAME(dirty), \ diff --git a/include/uapi/linux/kernel-page-flags.h b/include/uapi/linux/kernel-page-flags.h index 6f2f2720f3ac..ff8032227876 100644 --- a/include/uapi/linux/kernel-page-flags.h +++ b/include/uapi/linux/kernel-page-flags.h @@ -7,7 +7,7 @@ */ #define KPF_LOCKED 0 -#define KPF_ERROR 1 +#define KPF_ERROR 1 /* Now unused */ #define KPF_REFERENCED 2 #define KPF_UPTODATE 3 #define KPF_DIRTY 4 -- cgit v1.2.3 From 181028a0d84cdcc7ac86d05cc49eaa416ce85c8b Mon Sep 17 00:00:00 2001 From: Chandramohan Akula Date: Thu, 29 Aug 2024 08:34:05 -0700 Subject: RDMA/bnxt_re: Share a page to expose per SRQ info with userspace Gen P7 adapters needs to share a toggle bits information received in kernel driver with the user space. User space needs this info to arm the SRQ. User space application can get this page using the UAPI routines. Library will mmap this page and get the toggle bits to be used in the next ARM Doorbell. Uses a hash list to map the SRQ structure from the SRQ ID. SRQ structure is retrieved from the hash list while the library calls the UAPI routine to get the toggle page mapping. Currently the full page is mapped per SRQ. This can be optimized to enable multiple SRQs from the same application share the same page and different offsets in the page Signed-off-by: Chandramohan Akula Signed-off-by: Selvin Xavier Link: https://patch.msgid.link/1724945645-14989-4-git-send-email-selvin.xavier@broadcom.com Signed-off-by: Leon Romanovsky --- drivers/infiniband/hw/bnxt_re/bnxt_re.h | 2 ++ drivers/infiniband/hw/bnxt_re/ib_verbs.c | 34 +++++++++++++++++++++++++++++++- drivers/infiniband/hw/bnxt_re/ib_verbs.h | 1 + drivers/infiniband/hw/bnxt_re/main.c | 6 +++++- include/uapi/rdma/bnxt_re-abi.h | 6 ++++++ 5 files changed, 47 insertions(+), 2 deletions(-) (limited to 'include/uapi') diff --git a/drivers/infiniband/hw/bnxt_re/bnxt_re.h b/drivers/infiniband/hw/bnxt_re/bnxt_re.h index 0912d2fa9634..2be9a62d230f 100644 --- a/drivers/infiniband/hw/bnxt_re/bnxt_re.h +++ b/drivers/infiniband/hw/bnxt_re/bnxt_re.h @@ -141,6 +141,7 @@ struct bnxt_re_pacing { #define BNXT_RE_GRC_FIFO_REG_BASE 0x2000 #define MAX_CQ_HASH_BITS (16) +#define MAX_SRQ_HASH_BITS (16) struct bnxt_re_dev { struct ib_device ibdev; struct list_head list; @@ -196,6 +197,7 @@ struct bnxt_re_dev { struct work_struct dbq_fifo_check_work; struct delayed_work dbq_pacing_work; DECLARE_HASHTABLE(cq_hash, MAX_CQ_HASH_BITS); + DECLARE_HASHTABLE(srq_hash, MAX_SRQ_HASH_BITS); }; #define to_bnxt_re_dev(ptr, member) \ diff --git a/drivers/infiniband/hw/bnxt_re/ib_verbs.c b/drivers/infiniband/hw/bnxt_re/ib_verbs.c index eafbee4af067..f9f944a6094a 100644 --- a/drivers/infiniband/hw/bnxt_re/ib_verbs.c +++ b/drivers/infiniband/hw/bnxt_re/ib_verbs.c @@ -1707,6 +1707,10 @@ int bnxt_re_destroy_srq(struct ib_srq *ib_srq, struct ib_udata *udata) if (qplib_srq->cq) nq = qplib_srq->cq->nq; + if (rdev->chip_ctx->modes.toggle_bits & BNXT_QPLIB_SRQ_TOGGLE_BIT) { + free_page((unsigned long)srq->uctx_srq_page); + hash_del(&srq->hash_entry); + } bnxt_qplib_destroy_srq(&rdev->qplib_res, qplib_srq); ib_umem_release(srq->umem); atomic_dec(&rdev->stats.res.srq_count); @@ -1811,9 +1815,18 @@ int bnxt_re_create_srq(struct ib_srq *ib_srq, } if (udata) { - struct bnxt_re_srq_resp resp; + struct bnxt_re_srq_resp resp = {}; resp.srqid = srq->qplib_srq.id; + if (rdev->chip_ctx->modes.toggle_bits & BNXT_QPLIB_SRQ_TOGGLE_BIT) { + hash_add(rdev->srq_hash, &srq->hash_entry, srq->qplib_srq.id); + srq->uctx_srq_page = (void *)get_zeroed_page(GFP_KERNEL); + if (!srq->uctx_srq_page) { + rc = -ENOMEM; + goto fail; + } + resp.comp_mask |= BNXT_RE_SRQ_TOGGLE_PAGE_SUPPORT; + } rc = ib_copy_to_udata(udata, &resp, sizeof(resp)); if (rc) { ibdev_err(&rdev->ibdev, "SRQ copy to udata failed!"); @@ -4291,6 +4304,19 @@ static struct bnxt_re_cq *bnxt_re_search_for_cq(struct bnxt_re_dev *rdev, u32 cq return cq; } +static struct bnxt_re_srq *bnxt_re_search_for_srq(struct bnxt_re_dev *rdev, u32 srq_id) +{ + struct bnxt_re_srq *srq = NULL, *tmp_srq; + + hash_for_each_possible(rdev->srq_hash, tmp_srq, hash_entry, srq_id) { + if (tmp_srq->qplib_srq.id == srq_id) { + srq = tmp_srq; + break; + } + } + return srq; +} + /* Helper function to mmap the virtual memory from user app */ int bnxt_re_mmap(struct ib_ucontext *ib_uctx, struct vm_area_struct *vma) { @@ -4519,6 +4545,7 @@ static int UVERBS_HANDLER(BNXT_RE_METHOD_GET_TOGGLE_MEM)(struct uverbs_attr_bund struct bnxt_re_ucontext *uctx; struct ib_ucontext *ib_uctx; struct bnxt_re_dev *rdev; + struct bnxt_re_srq *srq; u32 length = PAGE_SIZE; struct bnxt_re_cq *cq; u64 mem_offset; @@ -4550,6 +4577,11 @@ static int UVERBS_HANDLER(BNXT_RE_METHOD_GET_TOGGLE_MEM)(struct uverbs_attr_bund addr = (u64)cq->uctx_cq_page; break; case BNXT_RE_SRQ_TOGGLE_MEM: + srq = bnxt_re_search_for_srq(rdev, res_id); + if (!srq) + return -EINVAL; + + addr = (u64)srq->uctx_srq_page; break; default: diff --git a/drivers/infiniband/hw/bnxt_re/ib_verbs.h b/drivers/infiniband/hw/bnxt_re/ib_verbs.h index 060bf62eceba..b789e47ec97a 100644 --- a/drivers/infiniband/hw/bnxt_re/ib_verbs.h +++ b/drivers/infiniband/hw/bnxt_re/ib_verbs.h @@ -78,6 +78,7 @@ struct bnxt_re_srq { struct ib_umem *umem; spinlock_t lock; /* protect srq */ void *uctx_srq_page; + struct hlist_node hash_entry; }; struct bnxt_re_qp { diff --git a/drivers/infiniband/hw/bnxt_re/main.c b/drivers/infiniband/hw/bnxt_re/main.c index 31ba89cffe9d..16a84ca1ce48 100644 --- a/drivers/infiniband/hw/bnxt_re/main.c +++ b/drivers/infiniband/hw/bnxt_re/main.c @@ -139,8 +139,10 @@ static void bnxt_re_set_drv_mode(struct bnxt_re_dev *rdev) if (bnxt_re_hwrm_qcaps(rdev)) dev_err(rdev_to_dev(rdev), "Failed to query hwrm qcaps\n"); - if (bnxt_qplib_is_chip_gen_p7(rdev->chip_ctx)) + if (bnxt_qplib_is_chip_gen_p7(rdev->chip_ctx)) { cctx->modes.toggle_bits |= BNXT_QPLIB_CQ_TOGGLE_BIT; + cctx->modes.toggle_bits |= BNXT_QPLIB_SRQ_TOGGLE_BIT; + } } static void bnxt_re_destroy_chip_ctx(struct bnxt_re_dev *rdev) @@ -1771,6 +1773,8 @@ static int bnxt_re_dev_init(struct bnxt_re_dev *rdev) bnxt_re_vf_res_config(rdev); } hash_init(rdev->cq_hash); + if (rdev->chip_ctx->modes.toggle_bits & BNXT_QPLIB_SRQ_TOGGLE_BIT) + hash_init(rdev->srq_hash); return 0; free_sctx: diff --git a/include/uapi/rdma/bnxt_re-abi.h b/include/uapi/rdma/bnxt_re-abi.h index 6821002931c8..faa9d62b3b30 100644 --- a/include/uapi/rdma/bnxt_re-abi.h +++ b/include/uapi/rdma/bnxt_re-abi.h @@ -141,8 +141,14 @@ struct bnxt_re_srq_req { __aligned_u64 srq_handle; }; +enum bnxt_re_srq_mask { + BNXT_RE_SRQ_TOGGLE_PAGE_SUPPORT = 0x1, +}; + struct bnxt_re_srq_resp { __u32 srqid; + __u32 rsvd; /* padding */ + __aligned_u64 comp_mask; }; enum bnxt_re_shpg_offt { -- cgit v1.2.3 From 8bfb74ae12fa4cd3c9b49bb5913610b5709bffd7 Mon Sep 17 00:00:00 2001 From: Pablo Neira Ayuso Date: Tue, 3 Sep 2024 01:09:27 +0200 Subject: netfilter: nf_tables: zero timeout means element never times out This patch uses zero as timeout marker for those elements that never expire when the element is created. If userspace provides no timeout for an element, then the default set timeout applies. However, if no default set timeout is specified and timeout flag is set on, then timeout extension is allocated and timeout is set to zero to allow for future updates. Use of zero a never timeout marker has been suggested by Phil Sutter. Note that, in older kernels, it is already possible to define elements that never expire by declaring a set with the set timeout flag set on and no global set timeout, in this case, new element with no explicit timeout never expire do not allocate the timeout extension, hence, they never expire. This approach makes it complicated to accomodate element timeout update, because element extensions do not support reallocations. Therefore, allocate the timeout extension and use the new marker for this case, but do not expose it to userspace to retain backward compatibility in the set listing. Signed-off-by: Pablo Neira Ayuso --- include/net/netfilter/nf_tables.h | 7 ++++-- include/uapi/linux/netfilter/nf_tables.h | 2 +- net/netfilter/nf_tables_api.c | 39 +++++++++++++++++++------------- net/netfilter/nft_dynset.c | 3 ++- 4 files changed, 31 insertions(+), 20 deletions(-) (limited to 'include/uapi') diff --git a/include/net/netfilter/nf_tables.h b/include/net/netfilter/nf_tables.h index 1e9b5e1659a1..7511918dce6f 100644 --- a/include/net/netfilter/nf_tables.h +++ b/include/net/netfilter/nf_tables.h @@ -832,8 +832,11 @@ static inline struct nft_set_elem_expr *nft_set_ext_expr(const struct nft_set_ex static inline bool __nft_set_elem_expired(const struct nft_set_ext *ext, u64 tstamp) { - return nft_set_ext_exists(ext, NFT_SET_EXT_TIMEOUT) && - time_after_eq64(tstamp, READ_ONCE(nft_set_ext_timeout(ext)->expiration)); + if (!nft_set_ext_exists(ext, NFT_SET_EXT_TIMEOUT) || + nft_set_ext_timeout(ext)->timeout == 0) + return false; + + return time_after_eq64(tstamp, READ_ONCE(nft_set_ext_timeout(ext)->expiration)); } static inline bool nft_set_elem_expired(const struct nft_set_ext *ext) diff --git a/include/uapi/linux/netfilter/nf_tables.h b/include/uapi/linux/netfilter/nf_tables.h index 639894ed1b97..d6476ca5d7a6 100644 --- a/include/uapi/linux/netfilter/nf_tables.h +++ b/include/uapi/linux/netfilter/nf_tables.h @@ -436,7 +436,7 @@ enum nft_set_elem_flags { * @NFTA_SET_ELEM_KEY: key value (NLA_NESTED: nft_data) * @NFTA_SET_ELEM_DATA: data value of mapping (NLA_NESTED: nft_data_attributes) * @NFTA_SET_ELEM_FLAGS: bitmask of nft_set_elem_flags (NLA_U32) - * @NFTA_SET_ELEM_TIMEOUT: timeout value (NLA_U64) + * @NFTA_SET_ELEM_TIMEOUT: timeout value, zero means never times out (NLA_U64) * @NFTA_SET_ELEM_EXPIRATION: expiration time (NLA_U64) * @NFTA_SET_ELEM_USERDATA: user data (NLA_BINARY) * @NFTA_SET_ELEM_EXPR: expression (NLA_NESTED: nft_expr_attributes) diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c index c295d6e6c1fb..ed85b10edb32 100644 --- a/net/netfilter/nf_tables_api.c +++ b/net/netfilter/nf_tables_api.c @@ -5815,24 +5815,31 @@ static int nf_tables_fill_setelem(struct sk_buff *skb, goto nla_put_failure; if (nft_set_ext_exists(ext, NFT_SET_EXT_TIMEOUT)) { - u64 expires, now = get_jiffies_64(); + u64 timeout = nft_set_ext_timeout(ext)->timeout; + u64 set_timeout = READ_ONCE(set->timeout); + __be64 msecs = 0; + + if (set_timeout != timeout) { + msecs = nf_jiffies64_to_msecs(timeout); + if (nla_put_be64(skb, NFTA_SET_ELEM_TIMEOUT, msecs, + NFTA_SET_ELEM_PAD)) + goto nla_put_failure; + } - if (nft_set_ext_timeout(ext)->timeout != READ_ONCE(set->timeout) && - nla_put_be64(skb, NFTA_SET_ELEM_TIMEOUT, - nf_jiffies64_to_msecs(nft_set_ext_timeout(ext)->timeout), - NFTA_SET_ELEM_PAD)) - goto nla_put_failure; + if (timeout > 0) { + u64 expires, now = get_jiffies_64(); - expires = READ_ONCE(nft_set_ext_timeout(ext)->expiration); - if (time_before64(now, expires)) - expires -= now; - else - expires = 0; + expires = READ_ONCE(nft_set_ext_timeout(ext)->expiration); + if (time_before64(now, expires)) + expires -= now; + else + expires = 0; - if (nla_put_be64(skb, NFTA_SET_ELEM_EXPIRATION, - nf_jiffies64_to_msecs(expires), - NFTA_SET_ELEM_PAD)) - goto nla_put_failure; + if (nla_put_be64(skb, NFTA_SET_ELEM_EXPIRATION, + nf_jiffies64_to_msecs(expires), + NFTA_SET_ELEM_PAD)) + goto nla_put_failure; + } } if (nft_set_ext_exists(ext, NFT_SET_EXT_USERDATA)) { @@ -7015,7 +7022,7 @@ static int nft_add_set_elem(struct nft_ctx *ctx, struct nft_set *set, goto err_parse_key_end; } - if (timeout > 0) { + if (set->flags & NFT_SET_TIMEOUT) { err = nft_set_ext_add(&tmpl, NFT_SET_EXT_TIMEOUT); if (err < 0) goto err_parse_key_end; diff --git a/net/netfilter/nft_dynset.c b/net/netfilter/nft_dynset.c index ed8d692bebe3..6a10305de24b 100644 --- a/net/netfilter/nft_dynset.c +++ b/net/netfilter/nft_dynset.c @@ -94,7 +94,8 @@ void nft_dynset_eval(const struct nft_expr *expr, if (set->ops->update(set, ®s->data[priv->sreg_key], nft_dynset_new, expr, regs, &ext)) { if (priv->op == NFT_DYNSET_OP_UPDATE && - nft_set_ext_exists(ext, NFT_SET_EXT_TIMEOUT)) { + nft_set_ext_exists(ext, NFT_SET_EXT_TIMEOUT) && + nft_set_ext_timeout(ext)->timeout != 0) { timeout = priv->timeout ? : READ_ONCE(set->timeout); WRITE_ONCE(nft_set_ext_timeout(ext)->expiration, get_jiffies_64() + timeout); } -- cgit v1.2.3 From 17519819926211e6b2834e00e4554bec0daf22ac Mon Sep 17 00:00:00 2001 From: Joey Gouly Date: Thu, 22 Aug 2024 16:11:03 +0100 Subject: arm64/ptrace: add support for FEAT_POE Add a regset for POE containing POR_EL0. Signed-off-by: Joey Gouly Cc: Catalin Marinas Cc: Will Deacon Reviewed-by: Mark Brown Reviewed-by: Catalin Marinas Reviewed-by: Anshuman Khandual Link: https://lore.kernel.org/r/20240822151113.1479789-21-joey.gouly@arm.com Signed-off-by: Will Deacon --- arch/arm64/kernel/ptrace.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++ include/uapi/linux/elf.h | 1 + 2 files changed, 47 insertions(+) (limited to 'include/uapi') diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c index 0d022599eb61..b756578aeaee 100644 --- a/arch/arm64/kernel/ptrace.c +++ b/arch/arm64/kernel/ptrace.c @@ -1440,6 +1440,39 @@ static int tagged_addr_ctrl_set(struct task_struct *target, const struct } #endif +#ifdef CONFIG_ARM64_POE +static int poe_get(struct task_struct *target, + const struct user_regset *regset, + struct membuf to) +{ + if (!system_supports_poe()) + return -EINVAL; + + return membuf_write(&to, &target->thread.por_el0, + sizeof(target->thread.por_el0)); +} + +static int poe_set(struct task_struct *target, const struct + user_regset *regset, unsigned int pos, + unsigned int count, const void *kbuf, const + void __user *ubuf) +{ + int ret; + long ctrl; + + if (!system_supports_poe()) + return -EINVAL; + + ret = user_regset_copyin(&pos, &count, &kbuf, &ubuf, &ctrl, 0, -1); + if (ret) + return ret; + + target->thread.por_el0 = ctrl; + + return 0; +} +#endif + enum aarch64_regset { REGSET_GPR, REGSET_FPR, @@ -1469,6 +1502,9 @@ enum aarch64_regset { #ifdef CONFIG_ARM64_TAGGED_ADDR_ABI REGSET_TAGGED_ADDR_CTRL, #endif +#ifdef CONFIG_ARM64_POE + REGSET_POE +#endif }; static const struct user_regset aarch64_regsets[] = { @@ -1628,6 +1664,16 @@ static const struct user_regset aarch64_regsets[] = { .set = tagged_addr_ctrl_set, }, #endif +#ifdef CONFIG_ARM64_POE + [REGSET_POE] = { + .core_note_type = NT_ARM_POE, + .n = 1, + .size = sizeof(long), + .align = sizeof(long), + .regset_get = poe_get, + .set = poe_set, + }, +#endif }; static const struct user_regset_view user_aarch64_view = { diff --git a/include/uapi/linux/elf.h b/include/uapi/linux/elf.h index b54b313bcf07..81762ff3c99e 100644 --- a/include/uapi/linux/elf.h +++ b/include/uapi/linux/elf.h @@ -441,6 +441,7 @@ typedef struct elf64_shdr { #define NT_ARM_ZA 0x40c /* ARM SME ZA registers */ #define NT_ARM_ZT 0x40d /* ARM SME ZT registers */ #define NT_ARM_FPMR 0x40e /* ARM floating point mode register */ +#define NT_ARM_POE 0x40f /* ARM POE registers */ #define NT_ARC_V2 0x600 /* ARCv2 accumulator/extra registers */ #define NT_VMCOREDD 0x700 /* Vmcore Device Dump Note */ #define NT_MIPS_DSP 0x800 /* MIPS DSP ASE registers */ -- cgit v1.2.3 From aa16880d9f13c6490e80ad614402c8a6fe6f3efa Mon Sep 17 00:00:00 2001 From: Alexander Mikhalitsyn Date: Tue, 3 Sep 2024 17:16:13 +0200 Subject: fuse: add basic infrastructure to support idmappings Add some preparational changes in fuse_get_req/fuse_force_creds to handle idmappings. Miklos suggested [1], [2] to change the meaning of in.h.uid/in.h.gid fields when daemon declares support for idmapped mounts. In a new semantic, we fill uid/gid values in fuse header with a id-mapped caller uid/gid (for requests which create new inodes), for all the rest cases we just send -1 to userspace. No functional changes intended. Link: https://lore.kernel.org/all/CAJfpegsVY97_5mHSc06mSw79FehFWtoXT=hhTUK_E-Yhr7OAuQ@mail.gmail.com/ [1] Link: https://lore.kernel.org/all/CAJfpegtHQsEUuFq1k4ZbTD3E1h-GsrN3PWyv7X8cg6sfU_W2Yw@mail.gmail.com/ [2] Signed-off-by: Alexander Mikhalitsyn Signed-off-by: Miklos Szeredi --- fs/fuse/dev.c | 50 +++++++++++++++++++++++++++++++++++------------ fs/fuse/inode.c | 1 + include/uapi/linux/fuse.h | 2 ++ 3 files changed, 41 insertions(+), 12 deletions(-) (limited to 'include/uapi') diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c index 1b6088288a16..b4092b3b4b5a 100644 --- a/fs/fuse/dev.c +++ b/fs/fuse/dev.c @@ -108,7 +108,9 @@ static void fuse_drop_waiting(struct fuse_conn *fc) static void fuse_put_request(struct fuse_req *req); -static struct fuse_req *fuse_get_req(struct fuse_mount *fm, bool for_background) +static struct fuse_req *fuse_get_req(struct mnt_idmap *idmap, + struct fuse_mount *fm, + bool for_background) { struct fuse_conn *fc = fm->fc; struct fuse_req *req; @@ -140,19 +142,37 @@ static struct fuse_req *fuse_get_req(struct fuse_mount *fm, bool for_background) goto out; } - req->in.h.uid = from_kuid(fc->user_ns, current_fsuid()); - req->in.h.gid = from_kgid(fc->user_ns, current_fsgid()); req->in.h.pid = pid_nr_ns(task_pid(current), fc->pid_ns); __set_bit(FR_WAITING, &req->flags); if (for_background) __set_bit(FR_BACKGROUND, &req->flags); - if (unlikely(req->in.h.uid == ((uid_t)-1) || - req->in.h.gid == ((gid_t)-1))) { - fuse_put_request(req); - return ERR_PTR(-EOVERFLOW); + if ((fm->sb->s_iflags & SB_I_NOIDMAP) || idmap) { + kuid_t idmapped_fsuid; + kgid_t idmapped_fsgid; + + /* + * Note, that when + * (fm->sb->s_iflags & SB_I_NOIDMAP) is true, then + * (idmap == &nop_mnt_idmap) is always true and therefore, + * mapped_fsuid(idmap, fc->user_ns) == current_fsuid(). + */ + idmapped_fsuid = idmap ? mapped_fsuid(idmap, fc->user_ns) : current_fsuid(); + idmapped_fsgid = idmap ? mapped_fsgid(idmap, fc->user_ns) : current_fsgid(); + req->in.h.uid = from_kuid(fc->user_ns, idmapped_fsuid); + req->in.h.gid = from_kgid(fc->user_ns, idmapped_fsgid); + + if (unlikely(req->in.h.uid == ((uid_t)-1) || + req->in.h.gid == ((gid_t)-1))) { + fuse_put_request(req); + return ERR_PTR(-EOVERFLOW); + } + } else { + req->in.h.uid = FUSE_INVALID_UIDGID; + req->in.h.gid = FUSE_INVALID_UIDGID; } + return req; out: @@ -497,8 +517,14 @@ static void fuse_force_creds(struct fuse_req *req) { struct fuse_conn *fc = req->fm->fc; - req->in.h.uid = from_kuid_munged(fc->user_ns, current_fsuid()); - req->in.h.gid = from_kgid_munged(fc->user_ns, current_fsgid()); + if (req->fm->sb->s_iflags & SB_I_NOIDMAP) { + req->in.h.uid = from_kuid_munged(fc->user_ns, current_fsuid()); + req->in.h.gid = from_kgid_munged(fc->user_ns, current_fsgid()); + } else { + req->in.h.uid = FUSE_INVALID_UIDGID; + req->in.h.gid = FUSE_INVALID_UIDGID; + } + req->in.h.pid = pid_nr_ns(task_pid(current), fc->pid_ns); } @@ -530,7 +556,7 @@ ssize_t fuse_simple_request(struct fuse_mount *fm, struct fuse_args *args) __set_bit(FR_FORCE, &req->flags); } else { WARN_ON(args->nocreds); - req = fuse_get_req(fm, false); + req = fuse_get_req(NULL, fm, false); if (IS_ERR(req)) return PTR_ERR(req); } @@ -591,7 +617,7 @@ int fuse_simple_background(struct fuse_mount *fm, struct fuse_args *args, __set_bit(FR_BACKGROUND, &req->flags); } else { WARN_ON(args->nocreds); - req = fuse_get_req(fm, true); + req = fuse_get_req(NULL, fm, true); if (IS_ERR(req)) return PTR_ERR(req); } @@ -613,7 +639,7 @@ static int fuse_simple_notify_reply(struct fuse_mount *fm, struct fuse_req *req; struct fuse_iqueue *fiq = &fm->fc->iq; - req = fuse_get_req(fm, false); + req = fuse_get_req(NULL, fm, false); if (IS_ERR(req)) return PTR_ERR(req); diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c index bebd89002328..ed53e173337b 100644 --- a/fs/fuse/inode.c +++ b/fs/fuse/inode.c @@ -1572,6 +1572,7 @@ static void fuse_sb_defaults(struct super_block *sb) sb->s_time_gran = 1; sb->s_export_op = &fuse_export_operations; sb->s_iflags |= SB_I_IMA_UNVERIFIABLE_SIGNATURE; + sb->s_iflags |= SB_I_NOIDMAP; if (sb->s_user_ns != &init_user_ns) sb->s_iflags |= SB_I_UNTRUSTED_MOUNTER; sb->s_flags &= ~(SB_NOSEC | SB_I_VERSION); diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h index d08b99d60f6f..2ccf38181df2 100644 --- a/include/uapi/linux/fuse.h +++ b/include/uapi/linux/fuse.h @@ -984,6 +984,8 @@ struct fuse_fallocate_in { */ #define FUSE_UNIQUE_RESEND (1ULL << 63) +#define FUSE_INVALID_UIDGID ((uint32_t)(-1)) + struct fuse_in_header { uint32_t len; uint32_t opcode; -- cgit v1.2.3 From 16e1503eaf329129170e4e7a078aee17686967a5 Mon Sep 17 00:00:00 2001 From: Alexander Mikhalitsyn Date: Tue, 3 Sep 2024 17:16:25 +0200 Subject: fuse: allow idmapped mounts Now we have everything in place and we can allow idmapped mounts by setting the FS_ALLOW_IDMAP flag. Notice that real availability of idmapped mounts will depend on the fuse daemon. Fuse daemon have to set FUSE_ALLOW_IDMAP flag in the FUSE_INIT reply. To discuss: - we enable idmapped mounts support only if "default_permissions" mode is enabled, because otherwise we would need to deal with UID/GID mappings in the userspace side OR provide the userspace with idmapped req->in.h.uid/req->in.h.gid values which is not something that we probably want to. Idmapped mounts philosophy is not about faking caller uid/gid. Some extra links and examples: - libfuse support https://github.com/mihalicyn/libfuse/commits/idmap_support - fuse-overlayfs support: https://github.com/mihalicyn/fuse-overlayfs/commits/idmap_support - cephfs-fuse conversion example https://github.com/mihalicyn/ceph/commits/fuse_idmap - glusterfs conversion example https://github.com/mihalicyn/glusterfs/commits/fuse_idmap Signed-off-by: Alexander Mikhalitsyn Reviewed-by: Christian Brauner Signed-off-by: Miklos Szeredi --- fs/fuse/inode.c | 12 +++++++++--- include/uapi/linux/fuse.h | 20 +++++++++++++++++++- 2 files changed, 28 insertions(+), 4 deletions(-) (limited to 'include/uapi') diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c index b08bc6081612..d7edb3fb829f 100644 --- a/fs/fuse/inode.c +++ b/fs/fuse/inode.c @@ -1348,6 +1348,12 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args, } if (flags & FUSE_NO_EXPORT_SUPPORT) fm->sb->s_export_op = &fuse_export_fid_operations; + if (flags & FUSE_ALLOW_IDMAP) { + if (fc->default_permissions) + fm->sb->s_iflags &= ~SB_I_NOIDMAP; + else + ok = false; + } } else { ra_pages = fc->max_read / PAGE_SIZE; fc->no_lock = 1; @@ -1395,7 +1401,7 @@ void fuse_send_init(struct fuse_mount *fm) FUSE_HANDLE_KILLPRIV_V2 | FUSE_SETXATTR_EXT | FUSE_INIT_EXT | FUSE_SECURITY_CTX | FUSE_CREATE_SUPP_GROUP | FUSE_HAS_EXPIRE_ONLY | FUSE_DIRECT_IO_ALLOW_MMAP | - FUSE_NO_EXPORT_SUPPORT | FUSE_HAS_RESEND; + FUSE_NO_EXPORT_SUPPORT | FUSE_HAS_RESEND | FUSE_ALLOW_IDMAP; #ifdef CONFIG_FUSE_DAX if (fm->fc->dax) flags |= FUSE_MAP_ALIGNMENT; @@ -1985,7 +1991,7 @@ static void fuse_kill_sb_anon(struct super_block *sb) static struct file_system_type fuse_fs_type = { .owner = THIS_MODULE, .name = "fuse", - .fs_flags = FS_HAS_SUBTYPE | FS_USERNS_MOUNT, + .fs_flags = FS_HAS_SUBTYPE | FS_USERNS_MOUNT | FS_ALLOW_IDMAP, .init_fs_context = fuse_init_fs_context, .parameters = fuse_fs_parameters, .kill_sb = fuse_kill_sb_anon, @@ -2006,7 +2012,7 @@ static struct file_system_type fuseblk_fs_type = { .init_fs_context = fuse_init_fs_context, .parameters = fuse_fs_parameters, .kill_sb = fuse_kill_sb_blk, - .fs_flags = FS_REQUIRES_DEV | FS_HAS_SUBTYPE, + .fs_flags = FS_REQUIRES_DEV | FS_HAS_SUBTYPE | FS_ALLOW_IDMAP, }; MODULE_ALIAS_FS("fuseblk"); diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h index 2ccf38181df2..f1e99458e29e 100644 --- a/include/uapi/linux/fuse.h +++ b/include/uapi/linux/fuse.h @@ -217,6 +217,9 @@ * - add backing_id to fuse_open_out, add FOPEN_PASSTHROUGH open flag * - add FUSE_NO_EXPORT_SUPPORT init flag * - add FUSE_NOTIFY_RESEND, add FUSE_HAS_RESEND init flag + * + * 7.41 + * - add FUSE_ALLOW_IDMAP */ #ifndef _LINUX_FUSE_H @@ -252,7 +255,7 @@ #define FUSE_KERNEL_VERSION 7 /** Minor version number of this interface */ -#define FUSE_KERNEL_MINOR_VERSION 40 +#define FUSE_KERNEL_MINOR_VERSION 41 /** The node ID of the root inode */ #define FUSE_ROOT_ID 1 @@ -421,6 +424,7 @@ struct fuse_file_lock { * FUSE_NO_EXPORT_SUPPORT: explicitly disable export support * FUSE_HAS_RESEND: kernel supports resending pending requests, and the high bit * of the request ID indicates resend requests + * FUSE_ALLOW_IDMAP: allow creation of idmapped mounts */ #define FUSE_ASYNC_READ (1 << 0) #define FUSE_POSIX_LOCKS (1 << 1) @@ -466,6 +470,7 @@ struct fuse_file_lock { /* Obsolete alias for FUSE_DIRECT_IO_ALLOW_MMAP */ #define FUSE_DIRECT_IO_RELAX FUSE_DIRECT_IO_ALLOW_MMAP +#define FUSE_ALLOW_IDMAP (1ULL << 40) /** * CUSE INIT request/reply flags @@ -984,6 +989,19 @@ struct fuse_fallocate_in { */ #define FUSE_UNIQUE_RESEND (1ULL << 63) +/** + * This value will be set by the kernel to + * (struct fuse_in_header).{uid,gid} fields in + * case when: + * - fuse daemon enabled FUSE_ALLOW_IDMAP + * - idmapping information is not available and uid/gid + * can not be mapped in accordance with an idmapping. + * + * Note: an idmapping information always available + * for inode creation operations like: + * FUSE_MKNOD, FUSE_SYMLINK, FUSE_MKDIR, FUSE_TMPFILE, + * FUSE_CREATE and FUSE_RENAME2 (with RENAME_WHITEOUT). + */ #define FUSE_INVALID_UIDGID ((uint32_t)(-1)) struct fuse_in_header { -- cgit v1.2.3 From 4e893545ef8712d25f3176790ebb95beb073637e Mon Sep 17 00:00:00 2001 From: Mariusz Tkaczyk Date: Wed, 4 Sep 2024 12:48:47 +0200 Subject: PCI/NPEM: Add Native PCIe Enclosure Management support MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Native PCIe Enclosure Management (NPEM, PCIe r6.1 sec 6.28) allows managing LEDs in storage enclosures. NPEM is indication oriented and it does not give direct access to LEDs. Although each indication *could* represent an individual LED, multiple indications could also be represented as a single, multi-color LED or a single LED blinking in a specific interval. The specification leaves that open. Each enabled indication (capability register bit on) is represented as a ledclass_dev which can be controlled through sysfs. For every ledclass device only 2 brightness states are allowed: LED_ON (1) or LED_OFF (0). This corresponds to the NPEM control register (Indication bit on/off). Ledclass devices appear in sysfs as child devices (subdirectory) of PCI device which has an NPEM Extended Capability and indication is enabled in NPEM capability register. For example, these are LEDs created for pcieport "10000:02:05.0" on my setup: leds/ ├── 10000:02:05.0:enclosure:fail ├── 10000:02:05.0:enclosure:locate ├── 10000:02:05.0:enclosure:ok └── 10000:02:05.0:enclosure:rebuild They can be also found in "/sys/class/leds" directory. The parent PCIe device domain/bus/device/function address is used to guarantee uniqueness across leds subsystem. To enable/disable a "fail" indication, the "brightness" file can be edited: echo 1 > ./leds/10000:02:05.0:enclosure:fail/brightness echo 0 > ./leds/10000:02:05.0:enclosure:fail/brightness PCIe r6.1, sec 7.9.19.2 defines the possible indications. Multiple indications for same parent PCIe device can conflict and hardware may update them when processing new request. To avoid issues, driver refresh all indications by reading back control register. This driver expects to be the exclusive NPEM extended capability manager. It waits up to 1 second after imposing new request, it doesn't verify if controller is busy before write, and it assumes the mutex lock gives protection from concurrent updates. If _DSM LED management is available, we assume the platform may be using NPEM for its own purposes (see PCI Firmware Spec r3.3 sec 4.7), so the driver does not use NPEM. A future patch will add _DSM support; an info message notes whether NPEM or _DSM is being used. NPEM is a PCIe extended capability so it should be registered in pcie_init_capabilities() but it is not possible due to LED dependency. The parent pci_device must be added earlier for led_classdev_register() to be successful. NPEM does not require configuration on kernel side, so it is safe to register LED devices later. Link: https://lore.kernel.org/r/20240904104848.23480-3-mariusz.tkaczyk@linux.intel.com Suggested-by: Lukas Wunner Signed-off-by: Mariusz Tkaczyk Signed-off-by: Bjorn Helgaas Tested-by: Stuart Hayes Reviewed-by: Christoph Hellwig Reviewed-by: Ilpo Järvinen --- Documentation/ABI/testing/sysfs-bus-pci | 63 +++++ drivers/pci/Kconfig | 9 + drivers/pci/Makefile | 1 + drivers/pci/npem.c | 423 ++++++++++++++++++++++++++++++++ drivers/pci/pci.h | 8 + drivers/pci/probe.c | 2 + drivers/pci/remove.c | 2 + include/linux/pci.h | 3 + include/uapi/linux/pci_regs.h | 35 +++ 9 files changed, 546 insertions(+) create mode 100644 drivers/pci/npem.c (limited to 'include/uapi') diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci index ecf47559f495..9ddbb1364698 100644 --- a/Documentation/ABI/testing/sysfs-bus-pci +++ b/Documentation/ABI/testing/sysfs-bus-pci @@ -500,3 +500,66 @@ Description: console drivers from the device. Raw users of pci-sysfs resourceN attributes must be terminated prior to resizing. Success of the resizing operation is not guaranteed. + +What: /sys/bus/pci/devices/.../leds/*:enclosure:*/brightness +What: /sys/class/leds/*:enclosure:*/brightness +Date: August 2024 +KernelVersion: 6.12 +Description: + LED indications on PCIe storage enclosures which are controlled + through the NPEM interface (Native PCIe Enclosure Management, + PCIe r6.1 sec 6.28) are accessible as led class devices, both + below /sys/class/leds and below NPEM-capable PCI devices. + + Although these led class devices could be manipulated manually, + in practice they are typically manipulated automatically by an + application such as ledmon(8). + + The name of a led class device is as follows: + :enclosure: + where: + + - is the domain, bus, device and function number + (e.g. 10000:02:05.0) + - is a short description of the LED indication + + Valid indications per PCIe r6.1 table 6-27 are: + + - ok (drive is functioning normally) + - locate (drive is being identified by an admin) + - fail (drive is not functioning properly) + - rebuild (drive is part of an array that is rebuilding) + - pfa (drive is predicted to fail soon) + - hotspare (drive is marked to be used as a replacement) + - ica (drive is part of an array that is degraded) + - ifa (drive is part of an array that is failed) + - idt (drive is not the right type for the connector) + - disabled (drive is disabled, removal is safe) + - specific0 to specific7 (enclosure-specific indications) + + Broadly, the indications fall into one of these categories: + + - to signify drive state (ok, locate, fail, idt, disabled) + - to signify drive role or state in a software RAID array + (rebuild, pfa, hotspare, ica, ifa) + - to signify any other role or state (specific0 to specific7) + + Mandatory indications per PCIe r6.1 sec 7.9.19.2 comprise: + ok, locate, fail, rebuild. All others are optional. + A led class device is only visible if the corresponding + indication is supported by the device. + + To manipulate the indications, write 0 (LED_OFF) or 1 (LED_ON) + to the "brightness" file. Note that manipulating an indication + may implicitly manipulate other indications at the vendor's + discretion. E.g. when the user lights up the "ok" indication, + the vendor may choose to automatically turn off the "fail" + indication. The current state of an indication can be + retrieved by reading its "brightness" file. + + The PCIe Base Specification allows vendors leeway to choose + different colors or blinking patterns for the indications, + but they typically follow the IBPI standard. E.g. the "locate" + indication is usually presented as one or two LEDs blinking at + 4 Hz frequency: + https://en.wikipedia.org/wiki/International_Blinking_Pattern_Interpretation diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig index aa4d1833f442..0d94e4a967d8 100644 --- a/drivers/pci/Kconfig +++ b/drivers/pci/Kconfig @@ -143,6 +143,15 @@ config PCI_IOV If unsure, say N. +config PCI_NPEM + bool "Native PCIe Enclosure Management" + depends on LEDS_CLASS=y + help + Support for Native PCIe Enclosure Management. It allows managing LED + indications in storage enclosures. Enclosure must support following + indications: OK, Locate, Fail, Rebuild, other indications are + optional. + config PCI_PRI bool "PCI PRI support" select PCI_ATS diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile index 8ddad57934a6..374c5c06d92f 100644 --- a/drivers/pci/Makefile +++ b/drivers/pci/Makefile @@ -35,6 +35,7 @@ obj-$(CONFIG_XEN_PCIDEV_FRONTEND) += xen-pcifront.o obj-$(CONFIG_VGA_ARB) += vgaarb.o obj-$(CONFIG_PCI_DOE) += doe.o obj-$(CONFIG_PCI_DYNAMIC_OF_NODES) += of_property.o +obj-$(CONFIG_PCI_NPEM) += npem.o # Endpoint library must be initialized before its users obj-$(CONFIG_PCI_ENDPOINT) += endpoint/ diff --git a/drivers/pci/npem.c b/drivers/pci/npem.c new file mode 100644 index 000000000000..5fe33d145108 --- /dev/null +++ b/drivers/pci/npem.c @@ -0,0 +1,423 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * PCIe Enclosure management driver created for LED interfaces based on + * indications. It says *what indications* blink but does not specify *how* + * they blink - it is hardware defined. + * + * The driver name refers to Native PCIe Enclosure Management. It is + * first indication oriented standard with specification. + * + * Native PCIe Enclosure Management (NPEM) + * PCIe Base Specification r6.1 sec 6.28, 7.9.19 + * + * Copyright (c) 2023-2024 Intel Corporation + * Mariusz Tkaczyk + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "pci.h" + +struct indication { + u32 bit; + const char *name; +}; + +static const struct indication npem_indications[] = { + {PCI_NPEM_IND_OK, "enclosure:ok"}, + {PCI_NPEM_IND_LOCATE, "enclosure:locate"}, + {PCI_NPEM_IND_FAIL, "enclosure:fail"}, + {PCI_NPEM_IND_REBUILD, "enclosure:rebuild"}, + {PCI_NPEM_IND_PFA, "enclosure:pfa"}, + {PCI_NPEM_IND_HOTSPARE, "enclosure:hotspare"}, + {PCI_NPEM_IND_ICA, "enclosure:ica"}, + {PCI_NPEM_IND_IFA, "enclosure:ifa"}, + {PCI_NPEM_IND_IDT, "enclosure:idt"}, + {PCI_NPEM_IND_DISABLED, "enclosure:disabled"}, + {PCI_NPEM_IND_SPEC_0, "enclosure:specific_0"}, + {PCI_NPEM_IND_SPEC_1, "enclosure:specific_1"}, + {PCI_NPEM_IND_SPEC_2, "enclosure:specific_2"}, + {PCI_NPEM_IND_SPEC_3, "enclosure:specific_3"}, + {PCI_NPEM_IND_SPEC_4, "enclosure:specific_4"}, + {PCI_NPEM_IND_SPEC_5, "enclosure:specific_5"}, + {PCI_NPEM_IND_SPEC_6, "enclosure:specific_6"}, + {PCI_NPEM_IND_SPEC_7, "enclosure:specific_7"}, + {0, NULL} +}; + +#define for_each_indication(ind, inds) \ + for (ind = inds; ind->bit; ind++) + +/* + * The driver has internal list of supported indications. Ideally, the driver + * should not touch bits that are not defined and for which LED devices are + * not exposed but in reality, it needs to turn them off. + * + * Otherwise, there will be no possibility to turn off indications turned on by + * other utilities or turned on by default and it leads to bad user experience. + * + * Additionally, it excludes NPEM commands like RESET or ENABLE. + */ +static u32 reg_to_indications(u32 caps, const struct indication *inds) +{ + const struct indication *ind; + u32 supported_indications = 0; + + for_each_indication(ind, inds) + supported_indications |= ind->bit; + + return caps & supported_indications; +} + +/** + * struct npem_led - LED details + * @indication: indication details + * @npem: NPEM device + * @name: LED name + * @led: LED device + */ +struct npem_led { + const struct indication *indication; + struct npem *npem; + char name[LED_MAX_NAME_SIZE]; + struct led_classdev led; +}; + +/** + * struct npem_ops - backend specific callbacks + * @get_active_indications: get active indications + * npem: NPEM device + * inds: response buffer + * @set_active_indications: set new indications + * npem: npem device + * inds: bit mask to set + * @inds: supported indications array, set of indications is backend specific + * @name: backend name + */ +struct npem_ops { + int (*get_active_indications)(struct npem *npem, u32 *inds); + int (*set_active_indications)(struct npem *npem, u32 inds); + const struct indication *inds; + const char *name; +}; + +/** + * struct npem - NPEM device properties + * @dev: PCI device this driver is attached to + * @ops: backend specific callbacks + * @lock: serializes concurrent access to NPEM device by multiple LED devices + * @pos: cached offset of NPEM Capability Register in Configuration Space; + * only used if NPEM registers are accessed directly and not through _DSM + * @supported_indications: cached bit mask of supported indications; + * non-indication and reserved bits in the NPEM Capability Register are + * cleared in this bit mask + * @active_indications: cached bit mask of active indications; + * non-indication and reserved bits in the NPEM Control Register are + * cleared in this bit mask + * @led_cnt: size of @leds array + * @leds: array containing LED class devices of all supported LEDs + */ +struct npem { + struct pci_dev *dev; + const struct npem_ops *ops; + struct mutex lock; + u16 pos; + u32 supported_indications; + u32 active_indications; + int led_cnt; + struct npem_led leds[]; +}; + +static int npem_read_reg(struct npem *npem, u16 reg, u32 *val) +{ + int ret = pci_read_config_dword(npem->dev, npem->pos + reg, val); + + return pcibios_err_to_errno(ret); +} + +static int npem_write_ctrl(struct npem *npem, u32 reg) +{ + int pos = npem->pos + PCI_NPEM_CTRL; + int ret = pci_write_config_dword(npem->dev, pos, reg); + + return pcibios_err_to_errno(ret); +} + +static int npem_get_active_indications(struct npem *npem, u32 *inds) +{ + u32 ctrl; + int ret; + + ret = npem_read_reg(npem, PCI_NPEM_CTRL, &ctrl); + if (ret) + return ret; + + /* If PCI_NPEM_CTRL_ENABLE is not set then no indication should blink */ + if (!(ctrl & PCI_NPEM_CTRL_ENABLE)) { + *inds = 0; + return 0; + } + + *inds = ctrl & npem->supported_indications; + + return 0; +} + +static int npem_set_active_indications(struct npem *npem, u32 inds) +{ + int ctrl, ret, ret_val; + u32 cc_status; + + lockdep_assert_held(&npem->lock); + + /* This bit is always required */ + ctrl = inds | PCI_NPEM_CTRL_ENABLE; + + ret = npem_write_ctrl(npem, ctrl); + if (ret) + return ret; + + /* + * For the case where a NPEM command has not completed immediately, + * it is recommended that software not continuously "spin" on polling + * the status register, but rather poll under interrupt at a reduced + * rate; for example at 10 ms intervals. + * + * PCIe r6.1 sec 6.28 "Implementation Note: Software Polling of NPEM + * Command Completed" + */ + ret = read_poll_timeout(npem_read_reg, ret_val, + ret_val || (cc_status & PCI_NPEM_STATUS_CC), + 10 * USEC_PER_MSEC, USEC_PER_SEC, false, npem, + PCI_NPEM_STATUS, &cc_status); + if (ret) + return ret; + if (ret_val) + return ret_val; + + /* + * All writes to control register, including writes that do not change + * the register value, are NPEM commands and should eventually result + * in a command completion indication in the NPEM Status Register. + * + * PCIe Base Specification r6.1 sec 7.9.19.3 + * + * Register may not be updated, or other conflicting bits may be + * cleared. Spec is not strict here. Read NPEM Control register after + * write to keep cache in-sync. + */ + return npem_get_active_indications(npem, &npem->active_indications); +} + +static const struct npem_ops npem_ops = { + .get_active_indications = npem_get_active_indications, + .set_active_indications = npem_set_active_indications, + .name = "Native PCIe Enclosure Management", + .inds = npem_indications, +}; + +#define DSM_GUID GUID_INIT(0x5d524d9d, 0xfff9, 0x4d4b, 0x8c, 0xb7, 0x74, 0x7e,\ + 0xd5, 0x1e, 0x19, 0x4d) +#define GET_SUPPORTED_STATES_DSM 1 +#define GET_STATE_DSM 2 +#define SET_STATE_DSM 3 + +static const guid_t dsm_guid = DSM_GUID; + +static bool npem_has_dsm(struct pci_dev *pdev) +{ + acpi_handle handle; + + handle = ACPI_HANDLE(&pdev->dev); + if (!handle) + return false; + + return acpi_check_dsm(handle, &dsm_guid, 0x1, + BIT(GET_SUPPORTED_STATES_DSM) | + BIT(GET_STATE_DSM) | BIT(SET_STATE_DSM)); +} + +/* + * The status of each indicator is cached on first brightness_ get/set time + * and updated at write time. brightness_get() is only responsible for + * reflecting the last written/cached value. + */ +static enum led_brightness brightness_get(struct led_classdev *led) +{ + struct npem_led *nled = container_of(led, struct npem_led, led); + struct npem *npem = nled->npem; + int ret, val = 0; + + ret = mutex_lock_interruptible(&npem->lock); + if (ret) + return ret; + + if (npem->active_indications & nled->indication->bit) + val = 1; + + mutex_unlock(&npem->lock); + return val; +} + +static int brightness_set(struct led_classdev *led, + enum led_brightness brightness) +{ + struct npem_led *nled = container_of(led, struct npem_led, led); + struct npem *npem = nled->npem; + u32 indications; + int ret; + + ret = mutex_lock_interruptible(&npem->lock); + if (ret) + return ret; + + if (brightness == 0) + indications = npem->active_indications & ~(nled->indication->bit); + else + indications = npem->active_indications | nled->indication->bit; + + ret = npem->ops->set_active_indications(npem, indications); + + mutex_unlock(&npem->lock); + return ret; +} + +static void npem_free(struct npem *npem) +{ + struct npem_led *nled; + int cnt; + + if (!npem) + return; + + for (cnt = 0; cnt < npem->led_cnt; cnt++) { + nled = &npem->leds[cnt]; + + if (nled->name[0]) + led_classdev_unregister(&nled->led); + } + + mutex_destroy(&npem->lock); + kfree(npem); +} + +static int pci_npem_set_led_classdev(struct npem *npem, struct npem_led *nled) +{ + struct led_classdev *led = &nled->led; + struct led_init_data init_data = {}; + char *name = nled->name; + int ret; + + init_data.devicename = pci_name(npem->dev); + init_data.default_label = nled->indication->name; + + ret = led_compose_name(&npem->dev->dev, &init_data, name); + if (ret) + return ret; + + led->name = name; + led->brightness_set_blocking = brightness_set; + led->brightness_get = brightness_get; + led->max_brightness = 1; + led->default_trigger = "none"; + led->flags = 0; + + ret = led_classdev_register(&npem->dev->dev, led); + if (ret) + /* Clear the name to indicate that it is not registered. */ + name[0] = 0; + return ret; +} + +static int pci_npem_init(struct pci_dev *dev, const struct npem_ops *ops, + int pos, u32 caps) +{ + u32 supported = reg_to_indications(caps, ops->inds); + int supported_cnt = hweight32(supported); + const struct indication *indication; + struct npem_led *nled; + struct npem *npem; + int led_idx = 0; + int ret; + + npem = kzalloc(struct_size(npem, leds, supported_cnt), GFP_KERNEL); + if (!npem) + return -ENOMEM; + + npem->supported_indications = supported; + npem->led_cnt = supported_cnt; + npem->pos = pos; + npem->dev = dev; + npem->ops = ops; + + ret = npem->ops->get_active_indications(npem, + &npem->active_indications); + if (ret) + return ret; + + mutex_init(&npem->lock); + + for_each_indication(indication, npem_indications) { + if (!(npem->supported_indications & indication->bit)) + continue; + + nled = &npem->leds[led_idx++]; + nled->indication = indication; + nled->npem = npem; + + ret = pci_npem_set_led_classdev(npem, nled); + if (ret) { + npem_free(npem); + return ret; + } + } + + dev->npem = npem; + return 0; +} + +void pci_npem_remove(struct pci_dev *dev) +{ + npem_free(dev->npem); +} + +void pci_npem_create(struct pci_dev *dev) +{ + const struct npem_ops *ops = &npem_ops; + int pos = 0, ret; + u32 cap; + + if (npem_has_dsm(dev)) { + /* + * OS should use the DSM for LED control if it is available + * PCI Firmware Spec r3.3 sec 4.7. + */ + pci_info(dev, "Not configuring %s because _DSM is present\n", + ops->name); + return; + } else { + pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_NPEM); + if (pos == 0) + return; + + if (pci_read_config_dword(dev, pos + PCI_NPEM_CAP, &cap) != 0 || + (cap & PCI_NPEM_CAP_CAPABLE) == 0) + return; + } + + pci_info(dev, "Configuring %s\n", ops->name); + + ret = pci_npem_init(dev, ops, pos, cap); + if (ret) + pci_err(dev, "Failed to register %s, err: %d\n", ops->name, + ret); +} diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index 79c8398f3938..554fd9dfe25c 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -398,6 +398,14 @@ static inline void pci_doe_destroy(struct pci_dev *pdev) { } static inline void pci_doe_disconnected(struct pci_dev *pdev) { } #endif +#ifdef CONFIG_PCI_NPEM +void pci_npem_create(struct pci_dev *dev); +void pci_npem_remove(struct pci_dev *dev); +#else +static inline void pci_npem_create(struct pci_dev *dev) { } +static inline void pci_npem_remove(struct pci_dev *dev) { } +#endif + /** * pci_dev_set_io_state - Set the new error state if possible. * diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c index b14b9876c030..17ee559c31a4 100644 --- a/drivers/pci/probe.c +++ b/drivers/pci/probe.c @@ -2593,6 +2593,8 @@ void pci_device_add(struct pci_dev *dev, struct pci_bus *bus) dev->match_driver = false; ret = device_add(&dev->dev); WARN_ON(ret < 0); + + pci_npem_create(dev); } struct pci_dev *pci_scan_single_device(struct pci_bus *bus, int devfn) diff --git a/drivers/pci/remove.c b/drivers/pci/remove.c index 910387e5bdbf..da9629c3a688 100644 --- a/drivers/pci/remove.c +++ b/drivers/pci/remove.c @@ -34,6 +34,8 @@ static void pci_destroy_dev(struct pci_dev *dev) if (!dev->dev.kobj.parent) return; + pci_npem_remove(dev); + device_del(&dev->dev); down_write(&pci_bus_sem); diff --git a/include/linux/pci.h b/include/linux/pci.h index 4cf89a4b4cbc..c9db853e269f 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -516,6 +516,9 @@ struct pci_dev { #endif #ifdef CONFIG_PCI_DOE struct xarray doe_mbs; /* Data Object Exchange mailboxes */ +#endif +#ifdef CONFIG_PCI_NPEM + struct npem *npem; /* Native PCIe Enclosure Management */ #endif u16 acs_cap; /* ACS Capability offset */ phys_addr_t rom; /* Physical address if not from BAR */ diff --git a/include/uapi/linux/pci_regs.h b/include/uapi/linux/pci_regs.h index 94c00996e633..9751b413f3d6 100644 --- a/include/uapi/linux/pci_regs.h +++ b/include/uapi/linux/pci_regs.h @@ -740,6 +740,7 @@ #define PCI_EXT_CAP_ID_DVSEC 0x23 /* Designated Vendor-Specific */ #define PCI_EXT_CAP_ID_DLF 0x25 /* Data Link Feature */ #define PCI_EXT_CAP_ID_PL_16GT 0x26 /* Physical Layer 16.0 GT/s */ +#define PCI_EXT_CAP_ID_NPEM 0x29 /* Native PCIe Enclosure Management */ #define PCI_EXT_CAP_ID_PL_32GT 0x2A /* Physical Layer 32.0 GT/s */ #define PCI_EXT_CAP_ID_DOE 0x2E /* Data Object Exchange */ #define PCI_EXT_CAP_ID_MAX PCI_EXT_CAP_ID_DOE @@ -1121,6 +1122,40 @@ #define PCI_PL_16GT_LE_CTRL_USP_TX_PRESET_MASK 0x000000F0 #define PCI_PL_16GT_LE_CTRL_USP_TX_PRESET_SHIFT 4 +/* Native PCIe Enclosure Management */ +#define PCI_NPEM_CAP 0x04 /* NPEM capability register */ +#define PCI_NPEM_CAP_CAPABLE 0x00000001 /* NPEM Capable */ + +#define PCI_NPEM_CTRL 0x08 /* NPEM control register */ +#define PCI_NPEM_CTRL_ENABLE 0x00000001 /* NPEM Enable */ + +/* + * Native PCIe Enclosure Management indication bits and Reset command bit + * are corresponding for capability and control registers. + */ +#define PCI_NPEM_CMD_RESET 0x00000002 /* Reset Command */ +#define PCI_NPEM_IND_OK 0x00000004 /* OK */ +#define PCI_NPEM_IND_LOCATE 0x00000008 /* Locate */ +#define PCI_NPEM_IND_FAIL 0x00000010 /* Fail */ +#define PCI_NPEM_IND_REBUILD 0x00000020 /* Rebuild */ +#define PCI_NPEM_IND_PFA 0x00000040 /* Predicted Failure Analysis */ +#define PCI_NPEM_IND_HOTSPARE 0x00000080 /* Hot Spare */ +#define PCI_NPEM_IND_ICA 0x00000100 /* In Critical Array */ +#define PCI_NPEM_IND_IFA 0x00000200 /* In Failed Array */ +#define PCI_NPEM_IND_IDT 0x00000400 /* Device Type */ +#define PCI_NPEM_IND_DISABLED 0x00000800 /* Disabled */ +#define PCI_NPEM_IND_SPEC_0 0x01000000 +#define PCI_NPEM_IND_SPEC_1 0x02000000 +#define PCI_NPEM_IND_SPEC_2 0x04000000 +#define PCI_NPEM_IND_SPEC_3 0x08000000 +#define PCI_NPEM_IND_SPEC_4 0x10000000 +#define PCI_NPEM_IND_SPEC_5 0x20000000 +#define PCI_NPEM_IND_SPEC_6 0x40000000 +#define PCI_NPEM_IND_SPEC_7 0x80000000 + +#define PCI_NPEM_STATUS 0x0c /* NPEM status register */ +#define PCI_NPEM_STATUS_CC 0x00000001 /* Command Completed */ + /* Data Object Exchange */ #define PCI_DOE_CAP 0x04 /* DOE Capabilities Register */ #define PCI_DOE_CAP_INT_SUP 0x00000001 /* Interrupt Support */ -- cgit v1.2.3 From 1083d733eb2624837753046924a9248555a1bfbf Mon Sep 17 00:00:00 2001 From: Ido Schimmel Date: Tue, 3 Sep 2024 16:35:54 +0300 Subject: ipv4: Fix user space build failure due to header change MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit RT_TOS() from include/uapi/linux/in_route.h is defined using IPTOS_TOS_MASK from include/uapi/linux/ip.h. This is problematic for files such as include/net/ip_fib.h that want to use RT_TOS() as without including both header files kernel compilation fails: In file included from ./include/net/ip_fib.h:25, from ./include/net/route.h:27, from ./include/net/lwtunnel.h:9, from net/core/dst.c:24: ./include/net/ip_fib.h: In function ‘fib_dscp_masked_match’: ./include/uapi/linux/in_route.h:31:32: error: ‘IPTOS_TOS_MASK’ undeclared (first use in this function) 31 | #define RT_TOS(tos) ((tos)&IPTOS_TOS_MASK) | ^~~~~~~~~~~~~~ ./include/net/ip_fib.h:440:45: note: in expansion of macro ‘RT_TOS’ 440 | return dscp == inet_dsfield_to_dscp(RT_TOS(fl4->flowi4_tos)); Therefore, cited commit changed linux/in_route.h to include linux/ip.h. However, as reported by David, this breaks iproute2 compilation due overlapping definitions between linux/ip.h and /usr/include/netinet/ip.h: In file included from ../include/uapi/linux/in_route.h:5, from iproute.c:19: ../include/uapi/linux/ip.h:25:9: warning: "IPTOS_TOS" redefined 25 | #define IPTOS_TOS(tos) ((tos)&IPTOS_TOS_MASK) | ^~~~~~~~~ In file included from iproute.c:17: /usr/include/netinet/ip.h:222:9: note: this is the location of the previous definition 222 | #define IPTOS_TOS(tos) ((tos) & IPTOS_TOS_MASK) Fix by changing include/net/ip_fib.h to include linux/ip.h. Note that usage of RT_TOS() should not spread further in the kernel due to recent work in this area. Fixes: 1fa3314c14c6 ("ipv4: Centralize TOS matching") Reported-by: David Ahern Closes: https://lore.kernel.org/netdev/2f5146ff-507d-4cab-a195-b28c0c9e654e@kernel.org/ Signed-off-by: Ido Schimmel Reviewed-by: David Ahern Reviewed-by: Guillaume Nault Link: https://patch.msgid.link/20240903133554.2807343-1-idosch@nvidia.com Signed-off-by: Jakub Kicinski --- include/net/ip_fib.h | 1 + include/uapi/linux/in_route.h | 2 -- 2 files changed, 1 insertion(+), 2 deletions(-) (limited to 'include/uapi') diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h index 269ec10f63e4..967e4dc555fa 100644 --- a/include/net/ip_fib.h +++ b/include/net/ip_fib.h @@ -22,6 +22,7 @@ #include #include #include +#include #include struct fib_config { diff --git a/include/uapi/linux/in_route.h b/include/uapi/linux/in_route.h index 10bdd7e7107f..0cc2c23b47f8 100644 --- a/include/uapi/linux/in_route.h +++ b/include/uapi/linux/in_route.h @@ -2,8 +2,6 @@ #ifndef _LINUX_IN_ROUTE_H #define _LINUX_IN_ROUTE_H -#include - /* IPv4 routing cache flags */ #define RTCF_DEAD RTNH_F_DEAD -- cgit v1.2.3 From b4fef22c2fb97fa204f0c99c7c7f1c6b422ef0aa Mon Sep 17 00:00:00 2001 From: Aleksa Sarai Date: Wed, 28 Aug 2024 20:19:42 +1000 Subject: uapi: explain how per-syscall AT_* flags should be allocated Unfortunately, the way we have gone about adding new AT_* flags has been a little messy. In the beginning, all of the AT_* flags had generic meanings and so it made sense to share the flag bits indiscriminately. However, we inevitably ran into syscalls that needed their own syscall-specific flags. Due to the lack of a planned out policy, we ended up with the following situations: * Existing syscalls adding new features tended to use new AT_* bits, with some effort taken to try to re-use bits for flags that were so obviously syscall specific that they only make sense for a single syscall (such as the AT_EACCESS/AT_REMOVEDIR/AT_HANDLE_FID triplet). Given the constraints of bitflags, this works well in practice, but ideally (to avoid future confusion) we would plan ahead and define a set of "per-syscall bits" ahead of time so that when allocating new bits we don't end up with a complete mish-mash of which bits are supposed to be per-syscall and which aren't. * New syscalls dealt with this in several ways: - Some syscalls (like renameat2(2), move_mount(2), fsopen(2), and fspick(2)) created their separate own flag spaces that have no overlap with the AT_* flags. Most of these ended up allocating their bits sequentually. In the case of move_mount(2) and fspick(2), several flags have identical meanings to AT_* flags but were allocated in their own flag space. This makes sense for syscalls that will never share AT_* flags, but for some syscalls this leads to duplication with AT_* flags in a way that could cause confusion (if renameat2(2) grew a RENAME_EMPTY_PATH it seems likely that users could mistake it for AT_EMPTY_PATH since it is an *at(2) syscall). - Some syscalls unfortunately ended up both creating their own flag space while also using bits from other flag spaces. The most obvious example is open_tree(2), where the standard usage ends up using flags from *THREE* separate flag spaces: open_tree(AT_FDCWD, "/foo", OPEN_TREE_CLONE|O_CLOEXEC|AT_RECURSIVE); (Note that O_CLOEXEC is also platform-specific, so several future OPEN_TREE_* bits are also made unusable in one fell swoop.) It's not entirely clear to me what the "right" choice is for new syscalls. Just saying that all future VFS syscalls should use AT_* flags doesn't seem practical. openat2(2) has RESOLVE_* flags (many of which don't make much sense to burn generic AT_* flags for) and move_mount(2) has separate AT_*-like flags for both the source and target so separate flags are needed anyway (though it seems possible that renameat2(2) could grow *_EMPTY_PATH flags at some point, and it's a bit of a shame they can't be reused). But at least for syscalls that _do_ choose to use AT_* flags, we should explicitly state the policy that 0x2ff is currently intended for per-syscall flags and that new flags should err on the side of overlapping with existing flag bits (so we can extend the scope of generic flags in the future if necessary). And add AT_* aliases for the RENAME_* flags to further cement that renameat2(2) is an *at(2) flag, just with its own per-syscall flags. Suggested-by: Amir Goldstein Reviewed-by: Jeff Layton Reviewed-by: Josef Bacik Signed-off-by: Aleksa Sarai Link: https://lore.kernel.org/r/20240828-exportfs-u64-mount-id-v3-1-10c2c4c16708@cyphar.com Reviewed-by: Jan Kara Signed-off-by: Christian Brauner --- include/uapi/linux/fcntl.h | 80 ++++++++++++++++++++++++++++++++-------------- 1 file changed, 56 insertions(+), 24 deletions(-) (limited to 'include/uapi') diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h index e55a3314bcb0..38a6d66d9e88 100644 --- a/include/uapi/linux/fcntl.h +++ b/include/uapi/linux/fcntl.h @@ -90,37 +90,69 @@ #define DN_ATTRIB 0x00000020 /* File changed attibutes */ #define DN_MULTISHOT 0x80000000 /* Don't remove notifier */ +#define AT_FDCWD -100 /* Special value for dirfd used to + indicate openat should use the + current working directory. */ + + +/* Generic flags for the *at(2) family of syscalls. */ + +/* Reserved for per-syscall flags 0xff. */ +#define AT_SYMLINK_NOFOLLOW 0x100 /* Do not follow symbolic + links. */ +/* Reserved for per-syscall flags 0x200 */ +#define AT_SYMLINK_FOLLOW 0x400 /* Follow symbolic links. */ +#define AT_NO_AUTOMOUNT 0x800 /* Suppress terminal automount + traversal. */ +#define AT_EMPTY_PATH 0x1000 /* Allow empty relative + pathname to operate on dirfd + directly. */ /* - * The constants AT_REMOVEDIR and AT_EACCESS have the same value. AT_EACCESS is - * meaningful only to faccessat, while AT_REMOVEDIR is meaningful only to - * unlinkat. The two functions do completely different things and therefore, - * the flags can be allowed to overlap. For example, passing AT_REMOVEDIR to - * faccessat would be undefined behavior and thus treating it equivalent to - * AT_EACCESS is valid undefined behavior. + * These flags are currently statx(2)-specific, but they could be made generic + * in the future and so they should not be used for other per-syscall flags. */ -#define AT_FDCWD -100 /* Special value used to indicate - openat should use the current - working directory. */ -#define AT_SYMLINK_NOFOLLOW 0x100 /* Do not follow symbolic links. */ +#define AT_STATX_SYNC_TYPE 0x6000 /* Type of synchronisation required from statx() */ +#define AT_STATX_SYNC_AS_STAT 0x0000 /* - Do whatever stat() does */ +#define AT_STATX_FORCE_SYNC 0x2000 /* - Force the attributes to be sync'd with the server */ +#define AT_STATX_DONT_SYNC 0x4000 /* - Don't sync attributes with the server */ + +#define AT_RECURSIVE 0x8000 /* Apply to the entire subtree */ + +/* + * Per-syscall flags for the *at(2) family of syscalls. + * + * These are flags that are so syscall-specific that a user passing these flags + * to the wrong syscall is so "clearly wrong" that we can safely call such + * usage "undefined behaviour". + * + * For example, the constants AT_REMOVEDIR and AT_EACCESS have the same value. + * AT_EACCESS is meaningful only to faccessat, while AT_REMOVEDIR is meaningful + * only to unlinkat. The two functions do completely different things and + * therefore, the flags can be allowed to overlap. For example, passing + * AT_REMOVEDIR to faccessat would be undefined behavior and thus treating it + * equivalent to AT_EACCESS is valid undefined behavior. + * + * Note for implementers: When picking a new per-syscall AT_* flag, try to + * reuse already existing flags first. This leaves us with as many unused bits + * as possible, so we can use them for generic bits in the future if necessary. + */ + +/* Flags for renameat2(2) (must match legacy RENAME_* flags). */ +#define AT_RENAME_NOREPLACE 0x0001 +#define AT_RENAME_EXCHANGE 0x0002 +#define AT_RENAME_WHITEOUT 0x0004 + +/* Flag for faccessat(2). */ #define AT_EACCESS 0x200 /* Test access permitted for effective IDs, not real IDs. */ +/* Flag for unlinkat(2). */ #define AT_REMOVEDIR 0x200 /* Remove directory instead of unlinking file. */ -#define AT_SYMLINK_FOLLOW 0x400 /* Follow symbolic links. */ -#define AT_NO_AUTOMOUNT 0x800 /* Suppress terminal automount traversal */ -#define AT_EMPTY_PATH 0x1000 /* Allow empty relative pathname */ - -#define AT_STATX_SYNC_TYPE 0x6000 /* Type of synchronisation required from statx() */ -#define AT_STATX_SYNC_AS_STAT 0x0000 /* - Do whatever stat() does */ -#define AT_STATX_FORCE_SYNC 0x2000 /* - Force the attributes to be sync'd with the server */ -#define AT_STATX_DONT_SYNC 0x4000 /* - Don't sync attributes with the server */ - -#define AT_RECURSIVE 0x8000 /* Apply to the entire subtree */ +/* Flags for name_to_handle_at(2). */ +#define AT_HANDLE_FID 0x200 /* File handle is needed to compare + object identity and may not be + usable with open_by_handle_at(2). */ -/* Flags for name_to_handle_at(2). We reuse AT_ flag space to save bits... */ -#define AT_HANDLE_FID AT_REMOVEDIR /* file handle is needed to - compare object identity and may not - be usable to open_by_handle_at(2) */ #if defined(__KERNEL__) #define AT_GETATTR_NOSEC 0x80000000 #endif -- cgit v1.2.3 From 4356d575ef0f39a3e8e0ce0c40d84ce900ac3b61 Mon Sep 17 00:00:00 2001 From: Aleksa Sarai Date: Wed, 28 Aug 2024 20:19:43 +1000 Subject: fhandle: expose u64 mount id to name_to_handle_at(2) Now that we provide a unique 64-bit mount ID interface in statx(2), we can now provide a race-free way for name_to_handle_at(2) to provide a file handle and corresponding mount without needing to worry about racing with /proc/mountinfo parsing or having to open a file just to do statx(2). While this is not necessary if you are using AT_EMPTY_PATH and don't care about an extra statx(2) call, users that pass full paths into name_to_handle_at(2) need to know which mount the file handle comes from (to make sure they don't try to open_by_handle_at a file handle from a different filesystem) and switching to AT_EMPTY_PATH would require allocating a file for every name_to_handle_at(2) call, turning err = name_to_handle_at(-EBADF, "/foo/bar/baz", &handle, &mntid, AT_HANDLE_MNT_ID_UNIQUE); into int fd = openat(-EBADF, "/foo/bar/baz", O_PATH | O_CLOEXEC); err1 = name_to_handle_at(fd, "", &handle, &unused_mntid, AT_EMPTY_PATH); err2 = statx(fd, "", AT_EMPTY_PATH, STATX_MNT_ID_UNIQUE, &statxbuf); mntid = statxbuf.stx_mnt_id; close(fd); Reviewed-by: Jeff Layton Signed-off-by: Aleksa Sarai Link: https://lore.kernel.org/r/20240828-exportfs-u64-mount-id-v3-2-10c2c4c16708@cyphar.com Reviewed-by: Jan Kara Reviewed-by: Josef Bacik Signed-off-by: Christian Brauner --- fs/fhandle.c | 29 ++++++++++++++++++++++------- include/linux/syscalls.h | 2 +- include/uapi/linux/fcntl.h | 1 + 3 files changed, 24 insertions(+), 8 deletions(-) (limited to 'include/uapi') diff --git a/fs/fhandle.c b/fs/fhandle.c index 6e8cea16790e..8cb665629f4a 100644 --- a/fs/fhandle.c +++ b/fs/fhandle.c @@ -16,7 +16,8 @@ static long do_sys_name_to_handle(const struct path *path, struct file_handle __user *ufh, - int __user *mnt_id, int fh_flags) + void __user *mnt_id, bool unique_mntid, + int fh_flags) { long retval; struct file_handle f_handle; @@ -69,9 +70,19 @@ static long do_sys_name_to_handle(const struct path *path, } else retval = 0; /* copy the mount id */ - if (put_user(real_mount(path->mnt)->mnt_id, mnt_id) || - copy_to_user(ufh, handle, - struct_size(handle, f_handle, handle_bytes))) + if (unique_mntid) { + if (put_user(real_mount(path->mnt)->mnt_id_unique, + (u64 __user *) mnt_id)) + retval = -EFAULT; + } else { + if (put_user(real_mount(path->mnt)->mnt_id, + (int __user *) mnt_id)) + retval = -EFAULT; + } + /* copy the handle */ + if (retval != -EFAULT && + copy_to_user(ufh, handle, + struct_size(handle, f_handle, handle_bytes))) retval = -EFAULT; kfree(handle); return retval; @@ -83,6 +94,7 @@ static long do_sys_name_to_handle(const struct path *path, * @name: name that should be converted to handle. * @handle: resulting file handle * @mnt_id: mount id of the file system containing the file + * (u64 if AT_HANDLE_MNT_ID_UNIQUE, otherwise int) * @flag: flag value to indicate whether to follow symlink or not * and whether a decodable file handle is required. * @@ -92,7 +104,7 @@ static long do_sys_name_to_handle(const struct path *path, * value required. */ SYSCALL_DEFINE5(name_to_handle_at, int, dfd, const char __user *, name, - struct file_handle __user *, handle, int __user *, mnt_id, + struct file_handle __user *, handle, void __user *, mnt_id, int, flag) { struct path path; @@ -100,7 +112,8 @@ SYSCALL_DEFINE5(name_to_handle_at, int, dfd, const char __user *, name, int fh_flags; int err; - if (flag & ~(AT_SYMLINK_FOLLOW | AT_EMPTY_PATH | AT_HANDLE_FID)) + if (flag & ~(AT_SYMLINK_FOLLOW | AT_EMPTY_PATH | AT_HANDLE_FID | + AT_HANDLE_MNT_ID_UNIQUE)) return -EINVAL; lookup_flags = (flag & AT_SYMLINK_FOLLOW) ? LOOKUP_FOLLOW : 0; @@ -109,7 +122,9 @@ SYSCALL_DEFINE5(name_to_handle_at, int, dfd, const char __user *, name, lookup_flags |= LOOKUP_EMPTY; err = user_path_at(dfd, name, lookup_flags, &path); if (!err) { - err = do_sys_name_to_handle(&path, handle, mnt_id, fh_flags); + err = do_sys_name_to_handle(&path, handle, mnt_id, + flag & AT_HANDLE_MNT_ID_UNIQUE, + fh_flags); path_put(&path); } return err; diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index 4bcf6754738d..5758104921e6 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -870,7 +870,7 @@ asmlinkage long sys_fanotify_mark(int fanotify_fd, unsigned int flags, #endif asmlinkage long sys_name_to_handle_at(int dfd, const char __user *name, struct file_handle __user *handle, - int __user *mnt_id, int flag); + void __user *mnt_id, int flag); asmlinkage long sys_open_by_handle_at(int mountdirfd, struct file_handle __user *handle, int flags); diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h index 38a6d66d9e88..87e2dec79fea 100644 --- a/include/uapi/linux/fcntl.h +++ b/include/uapi/linux/fcntl.h @@ -152,6 +152,7 @@ #define AT_HANDLE_FID 0x200 /* File handle is needed to compare object identity and may not be usable with open_by_handle_at(2). */ +#define AT_HANDLE_MNT_ID_UNIQUE 0x001 /* Return the u64 unique mount ID. */ #if defined(__KERNEL__) #define AT_GETATTR_NOSEC 0x80000000 -- cgit v1.2.3 From 6fe0593bfc3cfab8b4ec7255152a2be40b2e49a3 Mon Sep 17 00:00:00 2001 From: Erling Ljunggren Date: Wed, 28 Sep 2022 13:21:43 +0200 Subject: media: videodev2.h: add V4L2_CAP_EDID Add capability flag to indicate that the device is an EDID-only device. Signed-off-by: Erling Ljunggren Signed-off-by: Hans Verkuil Reviewed-by: Sebastian Fricke Reviewed-by: Ricardo Ribalda Signed-off-by: Mauro Carvalho Chehab --- include/uapi/linux/videodev2.h | 1 + 1 file changed, 1 insertion(+) (limited to 'include/uapi') diff --git a/include/uapi/linux/videodev2.h b/include/uapi/linux/videodev2.h index 725e86c4bbbd..27239cb64065 100644 --- a/include/uapi/linux/videodev2.h +++ b/include/uapi/linux/videodev2.h @@ -502,6 +502,7 @@ struct v4l2_capability { #define V4L2_CAP_META_CAPTURE 0x00800000 /* Is a metadata capture device */ #define V4L2_CAP_READWRITE 0x01000000 /* read/write systemcalls */ +#define V4L2_CAP_EDID 0x02000000 /* Is an EDID-only device */ #define V4L2_CAP_STREAMING 0x04000000 /* streaming I/O ioctls */ #define V4L2_CAP_META_OUTPUT 0x08000000 /* Is a metadata output device */ -- cgit v1.2.3 From c7a29258737076a7caf9ced6a7d710cce890abe5 Mon Sep 17 00:00:00 2001 From: Hans Verkuil Date: Wed, 12 Aug 2020 14:20:10 +0200 Subject: media: input: serio.h: add SERIO_EXTRON_DA_HD_PLUS Add a new serio ID for the Extron DA HD 4K Plus series of 4K HDMI Distribution Amplifiers. These devices support CEC over the serial port, so a new serio ID is needed to be able to associate the CEC driver. Signed-off-by: Hans Verkuil Acked-by: Dmitry Torokhov Signed-off-by: Mauro Carvalho Chehab --- include/uapi/linux/serio.h | 1 + 1 file changed, 1 insertion(+) (limited to 'include/uapi') diff --git a/include/uapi/linux/serio.h b/include/uapi/linux/serio.h index ed2a96f43ce4..5a2af0942c9f 100644 --- a/include/uapi/linux/serio.h +++ b/include/uapi/linux/serio.h @@ -83,5 +83,6 @@ #define SERIO_PULSE8_CEC 0x40 #define SERIO_RAINSHADOW_CEC 0x41 #define SERIO_FSIA6B 0x42 +#define SERIO_EXTRON_DA_HD_4K_PLUS 0x43 #endif /* _UAPI_SERIO_H */ -- cgit v1.2.3 From e49dacc71ec2621ce4c422cd5605d4d06f7807b0 Mon Sep 17 00:00:00 2001 From: Wouter Verhelst Date: Mon, 12 Aug 2024 15:20:37 +0200 Subject: nbd: implement the WRITE_ZEROES command The NBD protocol defines a message for zeroing out a region of an export Add support to the kernel driver for that message. Signed-off-by: Wouter Verhelst Cc: Eric Blake Reviewed-by: Damien Le Moal Link: https://lore.kernel.org/r/20240812133032.115134-3-w@uter.be Signed-off-by: Jens Axboe --- drivers/block/nbd.c | 8 ++++++++ include/uapi/linux/nbd.h | 5 ++++- 2 files changed, 12 insertions(+), 1 deletion(-) (limited to 'include/uapi') diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c index 4d06472bf112..eee632b84994 100644 --- a/drivers/block/nbd.c +++ b/drivers/block/nbd.c @@ -363,6 +363,8 @@ static int __nbd_set_size(struct nbd_device *nbd, loff_t bytesize, } if (nbd->config->flags & NBD_FLAG_ROTATIONAL) lim.features |= BLK_FEAT_ROTATIONAL; + if (nbd->config->flags & NBD_FLAG_SEND_WRITE_ZEROES) + lim.max_write_zeroes_sectors = UINT_MAX >> SECTOR_SHIFT; lim.logical_block_size = blksize; lim.physical_block_size = blksize; @@ -432,6 +434,8 @@ static u32 req_to_nbd_cmd_type(struct request *req) return NBD_CMD_WRITE; case REQ_OP_READ: return NBD_CMD_READ; + case REQ_OP_WRITE_ZEROES: + return NBD_CMD_WRITE_ZEROES; default: return U32_MAX; } @@ -648,6 +652,8 @@ static blk_status_t nbd_send_cmd(struct nbd_device *nbd, struct nbd_cmd *cmd, if (req->cmd_flags & REQ_FUA) nbd_cmd_flags |= NBD_CMD_FLAG_FUA; + if ((req->cmd_flags & REQ_NOUNMAP) && (type == NBD_CMD_WRITE_ZEROES)) + nbd_cmd_flags |= NBD_CMD_FLAG_NO_HOLE; /* We did a partial send previously, and we at least sent the whole * request struct, so just go and send the rest of the pages in the @@ -1717,6 +1723,8 @@ static int nbd_dbg_flags_show(struct seq_file *s, void *unused) seq_puts(s, "NBD_FLAG_SEND_FUA\n"); if (flags & NBD_FLAG_SEND_TRIM) seq_puts(s, "NBD_FLAG_SEND_TRIM\n"); + if (flags & NBD_FLAG_SEND_WRITE_ZEROES) + seq_puts(s, "NBD_FLAG_SEND_WRITE_ZEROES\n"); return 0; } diff --git a/include/uapi/linux/nbd.h b/include/uapi/linux/nbd.h index d75215f2c675..f1d468acfb25 100644 --- a/include/uapi/linux/nbd.h +++ b/include/uapi/linux/nbd.h @@ -42,8 +42,9 @@ enum { NBD_CMD_WRITE = 1, NBD_CMD_DISC = 2, NBD_CMD_FLUSH = 3, - NBD_CMD_TRIM = 4 + NBD_CMD_TRIM = 4, /* userspace defines additional extension commands */ + NBD_CMD_WRITE_ZEROES = 6, }; /* values for flags field, these are server interaction specific. */ @@ -53,11 +54,13 @@ enum { #define NBD_FLAG_SEND_FUA (1 << 3) /* send FUA (forced unit access) */ #define NBD_FLAG_ROTATIONAL (1 << 4) /* device is rotational */ #define NBD_FLAG_SEND_TRIM (1 << 5) /* send trim/discard */ +#define NBD_FLAG_SEND_WRITE_ZEROES (1 << 6) /* supports WRITE_ZEROES */ /* there is a gap here to match userspace */ #define NBD_FLAG_CAN_MULTI_CONN (1 << 8) /* Server supports multiple connections per export. */ /* values for cmd flags in the upper 16 bits of request type */ #define NBD_CMD_FLAG_FUA (1 << 16) /* FUA (forced unit access) op */ +#define NBD_CMD_FLAG_NO_HOLE (1 << 17) /* Do not punch a hole for WRITE_ZEROES */ /* These are client behavior specific flags. */ #define NBD_CFLAG_DESTROY_ON_DISCONNECT (1 << 0) /* delete the nbd device on -- cgit v1.2.3 From 663b0f1e141dc60ce6c09ae6afc5f213b22d13ca Mon Sep 17 00:00:00 2001 From: Philip Yang Date: Fri, 16 Feb 2024 11:00:10 -0500 Subject: drm/amdkfd: Document and define SVM events message macro Document how to use SMI system management interface to enable and receive SVM events. Document SVM event triggers. Define SVM events message string format macro that could be used by user mode for sscanf to parse the event. Add it to uAPI header file to make it obvious that is changing uAPI in future. No functional changes. Signed-off-by: Philip Yang Reviewed-by: James Zhu Signed-off-by: Alex Deucher --- drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c | 45 ++++++------- include/uapi/linux/kfd_ioctl.h | 100 ++++++++++++++++++++++++---- 2 files changed, 109 insertions(+), 36 deletions(-) (limited to 'include/uapi') diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c b/drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c index ea6a8e43bd5b..de8b9abf7afc 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c @@ -235,17 +235,16 @@ void kfd_smi_event_update_gpu_reset(struct kfd_node *dev, bool post_reset, amdgpu_reset_get_desc(reset_context, reset_cause, sizeof(reset_cause)); - kfd_smi_event_add(0, dev, event, "%x %s\n", - dev->reset_seq_num, - reset_cause); + kfd_smi_event_add(0, dev, event, KFD_EVENT_FMT_UPDATE_GPU_RESET( + dev->reset_seq_num, reset_cause)); } void kfd_smi_event_update_thermal_throttling(struct kfd_node *dev, uint64_t throttle_bitmask) { - kfd_smi_event_add(0, dev, KFD_SMI_EVENT_THERMAL_THROTTLE, "%llx:%llx\n", + kfd_smi_event_add(0, dev, KFD_SMI_EVENT_THERMAL_THROTTLE, KFD_EVENT_FMT_THERMAL_THROTTLING( throttle_bitmask, - amdgpu_dpm_get_thermal_throttling_counter(dev->adev)); + amdgpu_dpm_get_thermal_throttling_counter(dev->adev))); } void kfd_smi_event_update_vmfault(struct kfd_node *dev, uint16_t pasid) @@ -256,8 +255,8 @@ void kfd_smi_event_update_vmfault(struct kfd_node *dev, uint16_t pasid) if (task_info) { /* Report VM faults from user applications, not retry from kernel */ if (task_info->pid) - kfd_smi_event_add(0, dev, KFD_SMI_EVENT_VMFAULT, "%x:%s\n", - task_info->pid, task_info->task_name); + kfd_smi_event_add(0, dev, KFD_SMI_EVENT_VMFAULT, KFD_EVENT_FMT_VMFAULT( + task_info->pid, task_info->task_name)); amdgpu_vm_put_task_info(task_info); } } @@ -267,16 +266,16 @@ void kfd_smi_event_page_fault_start(struct kfd_node *node, pid_t pid, ktime_t ts) { kfd_smi_event_add(pid, node, KFD_SMI_EVENT_PAGE_FAULT_START, - "%lld -%d @%lx(%x) %c\n", ktime_to_ns(ts), pid, - address, node->id, write_fault ? 'W' : 'R'); + KFD_EVENT_FMT_PAGEFAULT_START(ktime_to_ns(ts), pid, + address, node->id, write_fault ? 'W' : 'R')); } void kfd_smi_event_page_fault_end(struct kfd_node *node, pid_t pid, unsigned long address, bool migration) { kfd_smi_event_add(pid, node, KFD_SMI_EVENT_PAGE_FAULT_END, - "%lld -%d @%lx(%x) %c\n", ktime_get_boottime_ns(), - pid, address, node->id, migration ? 'M' : 'U'); + KFD_EVENT_FMT_PAGEFAULT_END(ktime_get_boottime_ns(), + pid, address, node->id, migration ? 'M' : 'U')); } void kfd_smi_event_migration_start(struct kfd_node *node, pid_t pid, @@ -286,9 +285,9 @@ void kfd_smi_event_migration_start(struct kfd_node *node, pid_t pid, uint32_t trigger) { kfd_smi_event_add(pid, node, KFD_SMI_EVENT_MIGRATE_START, - "%lld -%d @%lx(%lx) %x->%x %x:%x %d\n", + KFD_EVENT_FMT_MIGRATE_START( ktime_get_boottime_ns(), pid, start, end - start, - from, to, prefetch_loc, preferred_loc, trigger); + from, to, prefetch_loc, preferred_loc, trigger)); } void kfd_smi_event_migration_end(struct kfd_node *node, pid_t pid, @@ -296,24 +295,24 @@ void kfd_smi_event_migration_end(struct kfd_node *node, pid_t pid, uint32_t from, uint32_t to, uint32_t trigger) { kfd_smi_event_add(pid, node, KFD_SMI_EVENT_MIGRATE_END, - "%lld -%d @%lx(%lx) %x->%x %d\n", + KFD_EVENT_FMT_MIGRATE_END( ktime_get_boottime_ns(), pid, start, end - start, - from, to, trigger); + from, to, trigger)); } void kfd_smi_event_queue_eviction(struct kfd_node *node, pid_t pid, uint32_t trigger) { kfd_smi_event_add(pid, node, KFD_SMI_EVENT_QUEUE_EVICTION, - "%lld -%d %x %d\n", ktime_get_boottime_ns(), pid, - node->id, trigger); + KFD_EVENT_FMT_QUEUE_EVICTION(ktime_get_boottime_ns(), pid, + node->id, trigger)); } void kfd_smi_event_queue_restore(struct kfd_node *node, pid_t pid) { kfd_smi_event_add(pid, node, KFD_SMI_EVENT_QUEUE_RESTORE, - "%lld -%d %x\n", ktime_get_boottime_ns(), pid, - node->id); + KFD_EVENT_FMT_QUEUE_RESTORE(ktime_get_boottime_ns(), pid, + node->id, 0)); } void kfd_smi_event_queue_restore_rescheduled(struct mm_struct *mm) @@ -330,8 +329,8 @@ void kfd_smi_event_queue_restore_rescheduled(struct mm_struct *mm) kfd_smi_event_add(p->lead_thread->pid, pdd->dev, KFD_SMI_EVENT_QUEUE_RESTORE, - "%lld -%d %x %c\n", ktime_get_boottime_ns(), - p->lead_thread->pid, pdd->dev->id, 'R'); + KFD_EVENT_FMT_QUEUE_RESTORE(ktime_get_boottime_ns(), + p->lead_thread->pid, pdd->dev->id, 'R')); } kfd_unref_process(p); } @@ -341,8 +340,8 @@ void kfd_smi_event_unmap_from_gpu(struct kfd_node *node, pid_t pid, uint32_t trigger) { kfd_smi_event_add(pid, node, KFD_SMI_EVENT_UNMAP_FROM_GPU, - "%lld -%d @%lx(%lx) %x %d\n", ktime_get_boottime_ns(), - pid, address, last - address + 1, node->id, trigger); + KFD_EVENT_FMT_UNMAP_FROM_GPU(ktime_get_boottime_ns(), + pid, address, last - address + 1, node->id, trigger)); } int kfd_smi_event_open(struct kfd_node *dev, uint32_t *fd) diff --git a/include/uapi/linux/kfd_ioctl.h b/include/uapi/linux/kfd_ioctl.h index 71a7ce5f2d4c..717307d6b5b7 100644 --- a/include/uapi/linux/kfd_ioctl.h +++ b/include/uapi/linux/kfd_ioctl.h @@ -540,26 +540,29 @@ enum kfd_smi_event { KFD_SMI_EVENT_ALL_PROCESS = 64 }; +/* The reason of the page migration event */ enum KFD_MIGRATE_TRIGGERS { - KFD_MIGRATE_TRIGGER_PREFETCH, - KFD_MIGRATE_TRIGGER_PAGEFAULT_GPU, - KFD_MIGRATE_TRIGGER_PAGEFAULT_CPU, - KFD_MIGRATE_TRIGGER_TTM_EVICTION + KFD_MIGRATE_TRIGGER_PREFETCH, /* Prefetch to GPU VRAM or system memory */ + KFD_MIGRATE_TRIGGER_PAGEFAULT_GPU, /* GPU page fault recover */ + KFD_MIGRATE_TRIGGER_PAGEFAULT_CPU, /* CPU page fault recover */ + KFD_MIGRATE_TRIGGER_TTM_EVICTION /* TTM eviction */ }; +/* The reason of user queue evition event */ enum KFD_QUEUE_EVICTION_TRIGGERS { - KFD_QUEUE_EVICTION_TRIGGER_SVM, - KFD_QUEUE_EVICTION_TRIGGER_USERPTR, - KFD_QUEUE_EVICTION_TRIGGER_TTM, - KFD_QUEUE_EVICTION_TRIGGER_SUSPEND, - KFD_QUEUE_EVICTION_CRIU_CHECKPOINT, - KFD_QUEUE_EVICTION_CRIU_RESTORE + KFD_QUEUE_EVICTION_TRIGGER_SVM, /* SVM buffer migration */ + KFD_QUEUE_EVICTION_TRIGGER_USERPTR, /* userptr movement */ + KFD_QUEUE_EVICTION_TRIGGER_TTM, /* TTM move buffer */ + KFD_QUEUE_EVICTION_TRIGGER_SUSPEND, /* GPU suspend */ + KFD_QUEUE_EVICTION_CRIU_CHECKPOINT, /* CRIU checkpoint */ + KFD_QUEUE_EVICTION_CRIU_RESTORE /* CRIU restore */ }; +/* The reason of unmap buffer from GPU event */ enum KFD_SVM_UNMAP_TRIGGERS { - KFD_SVM_UNMAP_TRIGGER_MMU_NOTIFY, - KFD_SVM_UNMAP_TRIGGER_MMU_NOTIFY_MIGRATE, - KFD_SVM_UNMAP_TRIGGER_UNMAP_FROM_CPU + KFD_SVM_UNMAP_TRIGGER_MMU_NOTIFY, /* MMU notifier CPU buffer movement */ + KFD_SVM_UNMAP_TRIGGER_MMU_NOTIFY_MIGRATE,/* MMU notifier page migration */ + KFD_SVM_UNMAP_TRIGGER_UNMAP_FROM_CPU /* Unmap to free the buffer */ }; #define KFD_SMI_EVENT_MASK_FROM_INDEX(i) (1ULL << ((i) - 1)) @@ -570,6 +573,77 @@ struct kfd_ioctl_smi_events_args { __u32 anon_fd; /* from KFD */ }; +/* + * SVM event tracing via SMI system management interface + * + * Open event file descriptor + * use ioctl AMDKFD_IOC_SMI_EVENTS, pass in gpuid and return a anonymous file + * descriptor to receive SMI events. + * If calling with sudo permission, then file descriptor can be used to receive + * SVM events from all processes, otherwise, to only receive SVM events of same + * process. + * + * To enable the SVM event + * Write event file descriptor with KFD_SMI_EVENT_MASK_FROM_INDEX(event) bitmap + * mask to start record the event to the kfifo, use bitmap mask combination + * for multiple events. New event mask will overwrite the previous event mask. + * KFD_SMI_EVENT_MASK_FROM_INDEX(KFD_SMI_EVENT_ALL_PROCESS) bit requires sudo + * permisson to receive SVM events from all process. + * + * To receive the event + * Application can poll file descriptor to wait for the events, then read event + * from the file into a buffer. Each event is one line string message, starting + * with the event id, then the event specific information. + * + * To decode event information + * The following event format string macro can be used with sscanf to decode + * the specific event information. + * event triggers: the reason to generate the event, defined as enum for unmap, + * eviction and migrate events. + * node, from, to, prefetch_loc, preferred_loc: GPU ID, or 0 for system memory. + * addr: user mode address, in pages + * size: in pages + * pid: the process ID to generate the event + * ns: timestamp in nanosecond-resolution, starts at system boot time but + * stops during suspend + * migrate_update: GPU page fault is recovered by 'M' for migrate, 'U' for update + * rw: 'W' for write page fault, 'R' for read page fault + * rescheduled: 'R' if the queue restore failed and rescheduled to try again + */ +#define KFD_EVENT_FMT_UPDATE_GPU_RESET(reset_seq_num, reset_cause)\ + "%x %s\n", (reset_seq_num), (reset_cause) + +#define KFD_EVENT_FMT_THERMAL_THROTTLING(bitmask, counter)\ + "%llx:%llx\n", (bitmask), (counter) + +#define KFD_EVENT_FMT_VMFAULT(pid, task_name)\ + "%x:%s\n", (pid), (task_name) + +#define KFD_EVENT_FMT_PAGEFAULT_START(ns, pid, addr, node, rw)\ + "%lld -%d @%lx(%x) %c\n", (ns), (pid), (addr), (node), (rw) + +#define KFD_EVENT_FMT_PAGEFAULT_END(ns, pid, addr, node, migrate_update)\ + "%lld -%d @%lx(%x) %c\n", (ns), (pid), (addr), (node), (migrate_update) + +#define KFD_EVENT_FMT_MIGRATE_START(ns, pid, start, size, from, to, prefetch_loc,\ + preferred_loc, migrate_trigger)\ + "%lld -%d @%lx(%lx) %x->%x %x:%x %d\n", (ns), (pid), (start), (size),\ + (from), (to), (prefetch_loc), (preferred_loc), (migrate_trigger) + +#define KFD_EVENT_FMT_MIGRATE_END(ns, pid, start, size, from, to, migrate_trigger)\ + "%lld -%d @%lx(%lx) %x->%x %d\n", (ns), (pid), (start), (size),\ + (from), (to), (migrate_trigger) + +#define KFD_EVENT_FMT_QUEUE_EVICTION(ns, pid, node, evict_trigger)\ + "%lld -%d %x %d\n", (ns), (pid), (node), (evict_trigger) + +#define KFD_EVENT_FMT_QUEUE_RESTORE(ns, pid, node, rescheduled)\ + "%lld -%d %x %c\n", (ns), (pid), (node), (rescheduled) + +#define KFD_EVENT_FMT_UNMAP_FROM_GPU(ns, pid, addr, size, node, unmap_trigger)\ + "%lld -%d @%lx(%lx) %x %d\n", (ns), (pid), (addr), (size),\ + (node), (unmap_trigger) + /************************************************************************************************** * CRIU IOCTLs (Checkpoint Restore In Userspace) * -- cgit v1.2.3 From c259acab839e57eab0318f32da4ae803a8d59397 Mon Sep 17 00:00:00 2001 From: Mahesh Bandewar Date: Wed, 4 Sep 2024 07:13:05 -0700 Subject: ptp/ioctl: support MONOTONIC{,_RAW} timestamps for PTP_SYS_OFFSET_EXTENDED The ability to read the PHC (Physical Hardware Clock) alongside multiple system clocks is currently dependent on the specific hardware architecture. This limitation restricts the use of PTP_SYS_OFFSET_PRECISE to certain hardware configurations. The generic soultion which would work across all architectures is to read the PHC along with the latency to perform PHC-read as offered by PTP_SYS_OFFSET_EXTENDED which provides pre and post timestamps. However, these timestamps are currently limited to the CLOCK_REALTIME timebase. Since CLOCK_REALTIME is affected by NTP (or similar time synchronization services), it can experience significant jumps forward or backward. This hinders the precise latency measurements that PTP_SYS_OFFSET_EXTENDED is designed to provide. This problem could be addressed by supporting MONOTONIC_RAW timestamps within PTP_SYS_OFFSET_EXTENDED. Unlike CLOCK_REALTIME or CLOCK_MONOTONIC, the MONOTONIC_RAW timebase is unaffected by NTP adjustments. This enhancement can be implemented by utilizing one of the three reserved words within the PTP_SYS_OFFSET_EXTENDED struct to pass the clock-id for timestamps. The current behavior aligns with clock-id for CLOCK_REALTIME timebase (value of 0), ensuring backward compatibility of the UAPI. Signed-off-by: Mahesh Bandewar Signed-off-by: Vadim Fedorenko Signed-off-by: David S. Miller --- drivers/ptp/ptp_chardev.c | 8 ++++++-- include/linux/ptp_clock_kernel.h | 36 ++++++++++++++++++++++++++++++++---- include/uapi/linux/ptp_clock.h | 24 ++++++++++++++++++------ 3 files changed, 56 insertions(+), 12 deletions(-) (limited to 'include/uapi') diff --git a/drivers/ptp/ptp_chardev.c b/drivers/ptp/ptp_chardev.c index 2067b0120d08..ea96a14d72d1 100644 --- a/drivers/ptp/ptp_chardev.c +++ b/drivers/ptp/ptp_chardev.c @@ -359,11 +359,15 @@ long ptp_ioctl(struct posix_clock_context *pccontext, unsigned int cmd, extoff = NULL; break; } - if (extoff->n_samples > PTP_MAX_SAMPLES - || extoff->rsv[0] || extoff->rsv[1] || extoff->rsv[2]) { + if (extoff->n_samples > PTP_MAX_SAMPLES || + extoff->rsv[0] || extoff->rsv[1] || + (extoff->clockid != CLOCK_REALTIME && + extoff->clockid != CLOCK_MONOTONIC && + extoff->clockid != CLOCK_MONOTONIC_RAW)) { err = -EINVAL; break; } + sts.clockid = extoff->clockid; for (i = 0; i < extoff->n_samples; i++) { err = ptp->info->gettimex64(ptp->info, &ts, &sts); if (err) diff --git a/include/linux/ptp_clock_kernel.h b/include/linux/ptp_clock_kernel.h index 6e4b8206c7d0..c892d22ce0a7 100644 --- a/include/linux/ptp_clock_kernel.h +++ b/include/linux/ptp_clock_kernel.h @@ -47,10 +47,12 @@ struct system_device_crosststamp; * struct ptp_system_timestamp - system time corresponding to a PHC timestamp * @pre_ts: system timestamp before capturing PHC * @post_ts: system timestamp after capturing PHC + * @clockid: clock-base used for capturing the system timestamps */ struct ptp_system_timestamp { struct timespec64 pre_ts; struct timespec64 post_ts; + clockid_t clockid; }; /** @@ -457,14 +459,40 @@ static inline ktime_t ptp_convert_timestamp(const ktime_t *hwtstamp, static inline void ptp_read_system_prets(struct ptp_system_timestamp *sts) { - if (sts) - ktime_get_real_ts64(&sts->pre_ts); + if (sts) { + switch (sts->clockid) { + case CLOCK_REALTIME: + ktime_get_real_ts64(&sts->pre_ts); + break; + case CLOCK_MONOTONIC: + ktime_get_ts64(&sts->pre_ts); + break; + case CLOCK_MONOTONIC_RAW: + ktime_get_raw_ts64(&sts->pre_ts); + break; + default: + break; + } + } } static inline void ptp_read_system_postts(struct ptp_system_timestamp *sts) { - if (sts) - ktime_get_real_ts64(&sts->post_ts); + if (sts) { + switch (sts->clockid) { + case CLOCK_REALTIME: + ktime_get_real_ts64(&sts->post_ts); + break; + case CLOCK_MONOTONIC: + ktime_get_ts64(&sts->post_ts); + break; + case CLOCK_MONOTONIC_RAW: + ktime_get_raw_ts64(&sts->post_ts); + break; + default: + break; + } + } } #endif diff --git a/include/uapi/linux/ptp_clock.h b/include/uapi/linux/ptp_clock.h index 053b40d642de..18eefa6d93d6 100644 --- a/include/uapi/linux/ptp_clock.h +++ b/include/uapi/linux/ptp_clock.h @@ -155,13 +155,25 @@ struct ptp_sys_offset { struct ptp_clock_time ts[2 * PTP_MAX_SAMPLES + 1]; }; +/* + * ptp_sys_offset_extended - data structure for IOCTL operation + * PTP_SYS_OFFSET_EXTENDED + * + * @n_samples: Desired number of measurements. + * @clockid: clockid of a clock-base used for pre/post timestamps. + * @rsv: Reserved for future use. + * @ts: Array of samples in the form [pre-TS, PHC, post-TS]. The + * kernel provides @n_samples. + * + * Starting from kernel 6.12 and onwards, the first word of the reserved-field + * is used for @clockid. That's backward compatible since previous kernel + * expect all three reserved words (@rsv[3]) to be 0 while the clockid (first + * word in the new structure) for CLOCK_REALTIME is '0'. + */ struct ptp_sys_offset_extended { - unsigned int n_samples; /* Desired number of measurements. */ - unsigned int rsv[3]; /* Reserved for future use. */ - /* - * Array of [system, phc, system] time stamps. The kernel will provide - * 3*n_samples time stamps. - */ + unsigned int n_samples; + __kernel_clockid_t clockid; + unsigned int rsv[2]; struct ptp_clock_time ts[PTP_MAX_SAMPLES][3]; }; -- cgit v1.2.3 From 6cf1c97dad2ebc4de03105cc444b3dfaa83f3dc2 Mon Sep 17 00:00:00 2001 From: zhenwei pi Date: Tue, 23 Apr 2024 11:41:07 +0800 Subject: virtio_balloon: introduce oom-kill invocations When the guest OS runs under critical memory pressure, the guest starts to kill processes. A guest monitor agent may scan 'oom_kill' from /proc/vmstat, and reports the OOM KILL event. However, the agent may be killed and we will loss this critical event(and the later events). For now we can also grep for magic words in guest kernel log from host side. Rather than this unstable way, virtio balloon reports OOM-KILL invocations instead. Acked-by: David Hildenbrand Signed-off-by: zhenwei pi Message-Id: <20240423034109.1552866-3-pizhenwei@bytedance.com> Signed-off-by: Michael S. Tsirkin --- drivers/virtio/virtio_balloon.c | 1 + include/uapi/linux/virtio_balloon.h | 6 ++++-- 2 files changed, 5 insertions(+), 2 deletions(-) (limited to 'include/uapi') diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index 54469277ca30..b54392366142 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -363,6 +363,7 @@ static inline unsigned int update_balloon_vm_stats(struct virtio_balloon *vb) pages_to_bytes(events[PSWPOUT])); update_stat(vb, idx++, VIRTIO_BALLOON_S_MAJFLT, events[PGMAJFAULT]); update_stat(vb, idx++, VIRTIO_BALLOON_S_MINFLT, events[PGFAULT]); + update_stat(vb, idx++, VIRTIO_BALLOON_S_OOM_KILL, events[OOM_KILL]); #ifdef CONFIG_HUGETLB_PAGE update_stat(vb, idx++, VIRTIO_BALLOON_S_HTLB_PGALLOC, diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h index ddaa45e723c4..b17bbe033697 100644 --- a/include/uapi/linux/virtio_balloon.h +++ b/include/uapi/linux/virtio_balloon.h @@ -71,7 +71,8 @@ struct virtio_balloon_config { #define VIRTIO_BALLOON_S_CACHES 7 /* Disk caches */ #define VIRTIO_BALLOON_S_HTLB_PGALLOC 8 /* Hugetlb page allocations */ #define VIRTIO_BALLOON_S_HTLB_PGFAIL 9 /* Hugetlb page allocation failures */ -#define VIRTIO_BALLOON_S_NR 10 +#define VIRTIO_BALLOON_S_OOM_KILL 10 /* OOM killer invocations */ +#define VIRTIO_BALLOON_S_NR 11 #define VIRTIO_BALLOON_S_NAMES_WITH_PREFIX(VIRTIO_BALLOON_S_NAMES_prefix) { \ VIRTIO_BALLOON_S_NAMES_prefix "swap-in", \ @@ -83,7 +84,8 @@ struct virtio_balloon_config { VIRTIO_BALLOON_S_NAMES_prefix "available-memory", \ VIRTIO_BALLOON_S_NAMES_prefix "disk-caches", \ VIRTIO_BALLOON_S_NAMES_prefix "hugetlb-allocations", \ - VIRTIO_BALLOON_S_NAMES_prefix "hugetlb-failures" \ + VIRTIO_BALLOON_S_NAMES_prefix "hugetlb-failures", \ + VIRTIO_BALLOON_S_NAMES_prefix "oom-kills" \ } #define VIRTIO_BALLOON_S_NAMES VIRTIO_BALLOON_S_NAMES_WITH_PREFIX("") -- cgit v1.2.3 From c5b70a26aac39f09a23fd72f44cfbb3d4d5a14d5 Mon Sep 17 00:00:00 2001 From: zhenwei pi Date: Tue, 23 Apr 2024 11:41:08 +0800 Subject: virtio_balloon: introduce memory allocation stall counter Memory allocation stall counter represents the performance/latency of memory allocation, expose this counter to the host side by virtio balloon device via out-of-bound way. Acked-by: David Hildenbrand Signed-off-by: zhenwei pi Message-Id: <20240423034109.1552866-4-pizhenwei@bytedance.com> Signed-off-by: Michael S. Tsirkin --- drivers/virtio/virtio_balloon.c | 8 ++++++++ include/uapi/linux/virtio_balloon.h | 6 ++++-- 2 files changed, 12 insertions(+), 2 deletions(-) (limited to 'include/uapi') diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index b54392366142..6f108b2977de 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -355,6 +355,8 @@ static inline unsigned int update_balloon_vm_stats(struct virtio_balloon *vb) { unsigned long events[NR_VM_EVENT_ITEMS]; unsigned int idx = 0; + unsigned int zid; + unsigned long stall = 0; all_vm_events(events); update_stat(vb, idx++, VIRTIO_BALLOON_S_SWAP_IN, @@ -365,6 +367,12 @@ static inline unsigned int update_balloon_vm_stats(struct virtio_balloon *vb) update_stat(vb, idx++, VIRTIO_BALLOON_S_MINFLT, events[PGFAULT]); update_stat(vb, idx++, VIRTIO_BALLOON_S_OOM_KILL, events[OOM_KILL]); + /* sum all the stall events */ + for (zid = 0; zid < MAX_NR_ZONES; zid++) + stall += events[ALLOCSTALL_NORMAL - ZONE_NORMAL + zid]; + + update_stat(vb, idx++, VIRTIO_BALLOON_S_ALLOC_STALL, stall); + #ifdef CONFIG_HUGETLB_PAGE update_stat(vb, idx++, VIRTIO_BALLOON_S_HTLB_PGALLOC, events[HTLB_BUDDY_PGALLOC]); diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h index b17bbe033697..487b893a160e 100644 --- a/include/uapi/linux/virtio_balloon.h +++ b/include/uapi/linux/virtio_balloon.h @@ -72,7 +72,8 @@ struct virtio_balloon_config { #define VIRTIO_BALLOON_S_HTLB_PGALLOC 8 /* Hugetlb page allocations */ #define VIRTIO_BALLOON_S_HTLB_PGFAIL 9 /* Hugetlb page allocation failures */ #define VIRTIO_BALLOON_S_OOM_KILL 10 /* OOM killer invocations */ -#define VIRTIO_BALLOON_S_NR 11 +#define VIRTIO_BALLOON_S_ALLOC_STALL 11 /* Stall count of memory allocatoin */ +#define VIRTIO_BALLOON_S_NR 12 #define VIRTIO_BALLOON_S_NAMES_WITH_PREFIX(VIRTIO_BALLOON_S_NAMES_prefix) { \ VIRTIO_BALLOON_S_NAMES_prefix "swap-in", \ @@ -85,7 +86,8 @@ struct virtio_balloon_config { VIRTIO_BALLOON_S_NAMES_prefix "disk-caches", \ VIRTIO_BALLOON_S_NAMES_prefix "hugetlb-allocations", \ VIRTIO_BALLOON_S_NAMES_prefix "hugetlb-failures", \ - VIRTIO_BALLOON_S_NAMES_prefix "oom-kills" \ + VIRTIO_BALLOON_S_NAMES_prefix "oom-kills", \ + VIRTIO_BALLOON_S_NAMES_prefix "alloc-stalls" \ } #define VIRTIO_BALLOON_S_NAMES VIRTIO_BALLOON_S_NAMES_WITH_PREFIX("") -- cgit v1.2.3 From 74c025c5d7e4ac7c7ad269c1ee64da4bdfe4770c Mon Sep 17 00:00:00 2001 From: zhenwei pi Date: Tue, 23 Apr 2024 11:41:09 +0800 Subject: virtio_balloon: introduce memory scan/reclaim info Expose memory scan/reclaim information to the host side via virtio balloon device. Now we have a metric to analyze the memory performance: y: counter increases n: counter does not changes h: the rate of counter change is high l: the rate of counter change is low OOM: VIRTIO_BALLOON_S_OOM_KILL STALL: VIRTIO_BALLOON_S_ALLOC_STALL ASCAN: VIRTIO_BALLOON_S_SCAN_ASYNC DSCAN: VIRTIO_BALLOON_S_SCAN_DIRECT ARCLM: VIRTIO_BALLOON_S_RECLAIM_ASYNC DRCLM: VIRTIO_BALLOON_S_RECLAIM_DIRECT - OOM[y], STALL[*], ASCAN[*], DSCAN[*], ARCLM[*], DRCLM[*]: the guest runs under really critial memory pressure - OOM[n], STALL[h], ASCAN[*], DSCAN[l], ARCLM[*], DRCLM[l]: the memory allocation stalls due to cgroup, not the global memory pressure. - OOM[n], STALL[h], ASCAN[*], DSCAN[h], ARCLM[*], DRCLM[h]: the memory allocation stalls due to global memory pressure. The performance gets hurt a lot. A high ratio between DRCLM/DSCAN shows quite effective memory reclaiming. - OOM[n], STALL[h], ASCAN[*], DSCAN[h], ARCLM[*], DRCLM[l]: the memory allocation stalls due to global memory pressure. the ratio between DRCLM/DSCAN gets low, the guest OS is thrashing heavily, the serious case leads poor performance and difficult trouble shooting. Ex, sshd may block on memory allocation when accepting new connections, a user can't login a VM by ssh command. - OOM[n], STALL[n], ASCAN[h], DSCAN[n], ARCLM[l], DRCLM[n]: the low ratio between ARCLM/ASCAN shows that the guest tries to reclaim more memory, but it can't. Once more memory is required in future, it will struggle to reclaim memory. Acked-by: David Hildenbrand Signed-off-by: zhenwei pi Message-Id: <20240423034109.1552866-5-pizhenwei@bytedance.com> Signed-off-by: Michael S. Tsirkin --- drivers/virtio/virtio_balloon.c | 9 +++++++++ include/uapi/linux/virtio_balloon.h | 12 ++++++++++-- 2 files changed, 19 insertions(+), 2 deletions(-) (limited to 'include/uapi') diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index 6f108b2977de..b36d2803674e 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -373,6 +373,15 @@ static inline unsigned int update_balloon_vm_stats(struct virtio_balloon *vb) update_stat(vb, idx++, VIRTIO_BALLOON_S_ALLOC_STALL, stall); + update_stat(vb, idx++, VIRTIO_BALLOON_S_ASYNC_SCAN, + pages_to_bytes(events[PGSCAN_KSWAPD])); + update_stat(vb, idx++, VIRTIO_BALLOON_S_DIRECT_SCAN, + pages_to_bytes(events[PGSCAN_DIRECT])); + update_stat(vb, idx++, VIRTIO_BALLOON_S_ASYNC_RECLAIM, + pages_to_bytes(events[PGSTEAL_KSWAPD])); + update_stat(vb, idx++, VIRTIO_BALLOON_S_DIRECT_RECLAIM, + pages_to_bytes(events[PGSTEAL_DIRECT])); + #ifdef CONFIG_HUGETLB_PAGE update_stat(vb, idx++, VIRTIO_BALLOON_S_HTLB_PGALLOC, events[HTLB_BUDDY_PGALLOC]); diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h index 487b893a160e..ee35a372805d 100644 --- a/include/uapi/linux/virtio_balloon.h +++ b/include/uapi/linux/virtio_balloon.h @@ -73,7 +73,11 @@ struct virtio_balloon_config { #define VIRTIO_BALLOON_S_HTLB_PGFAIL 9 /* Hugetlb page allocation failures */ #define VIRTIO_BALLOON_S_OOM_KILL 10 /* OOM killer invocations */ #define VIRTIO_BALLOON_S_ALLOC_STALL 11 /* Stall count of memory allocatoin */ -#define VIRTIO_BALLOON_S_NR 12 +#define VIRTIO_BALLOON_S_ASYNC_SCAN 12 /* Amount of memory scanned asynchronously */ +#define VIRTIO_BALLOON_S_DIRECT_SCAN 13 /* Amount of memory scanned directly */ +#define VIRTIO_BALLOON_S_ASYNC_RECLAIM 14 /* Amount of memory reclaimed asynchronously */ +#define VIRTIO_BALLOON_S_DIRECT_RECLAIM 15 /* Amount of memory reclaimed directly */ +#define VIRTIO_BALLOON_S_NR 16 #define VIRTIO_BALLOON_S_NAMES_WITH_PREFIX(VIRTIO_BALLOON_S_NAMES_prefix) { \ VIRTIO_BALLOON_S_NAMES_prefix "swap-in", \ @@ -87,7 +91,11 @@ struct virtio_balloon_config { VIRTIO_BALLOON_S_NAMES_prefix "hugetlb-allocations", \ VIRTIO_BALLOON_S_NAMES_prefix "hugetlb-failures", \ VIRTIO_BALLOON_S_NAMES_prefix "oom-kills", \ - VIRTIO_BALLOON_S_NAMES_prefix "alloc-stalls" \ + VIRTIO_BALLOON_S_NAMES_prefix "alloc-stalls", \ + VIRTIO_BALLOON_S_NAMES_prefix "async-scans", \ + VIRTIO_BALLOON_S_NAMES_prefix "direct-scans", \ + VIRTIO_BALLOON_S_NAMES_prefix "async-reclaims", \ + VIRTIO_BALLOON_S_NAMES_prefix "direct-reclaims" \ } #define VIRTIO_BALLOON_S_NAMES VIRTIO_BALLOON_S_NAMES_WITH_PREFIX("") -- cgit v1.2.3 From 2f87e9cf0c9e21ab9be1fb2ba8520a1525359497 Mon Sep 17 00:00:00 2001 From: Cindy Lu Date: Wed, 31 Jul 2024 11:16:01 +0800 Subject: vdpa: support set mac address from vdpa tool Add new UAPI to support the mac address from vdpa tool Function vdpa_nl_cmd_dev_attr_set_doit() will get the new MAC address from the vdpa tool and then set it to the device. The usage is: vdpa dev set name vdpa_name mac **:**:**:**:**:** Here is example: root@L1# vdpa -jp dev config show vdpa0 { "config": { "vdpa0": { "mac": "82:4d:e9:5d:d7:e6", "link ": "up", "link_announce ": false, "mtu": 1500 } } } root@L1# vdpa dev set name vdpa0 mac 00:11:22:33:44:55 root@L1# vdpa -jp dev config show vdpa0 { "config": { "vdpa0": { "mac": "00:11:22:33:44:55", "link ": "up", "link_announce ": false, "mtu": 1500 } } } Signed-off-by: Cindy Lu Message-Id: <20240731031653.1047692-2-lulu@redhat.com> Signed-off-by: Michael S. Tsirkin Acked-by: Jason Wang --- drivers/vdpa/vdpa.c | 79 +++++++++++++++++++++++++++++++++++++++++++++++ include/linux/vdpa.h | 9 ++++++ include/uapi/linux/vdpa.h | 1 + 3 files changed, 89 insertions(+) (limited to 'include/uapi') diff --git a/drivers/vdpa/vdpa.c b/drivers/vdpa/vdpa.c index 4dbd2e55a288..8a372b51c21a 100644 --- a/drivers/vdpa/vdpa.c +++ b/drivers/vdpa/vdpa.c @@ -1361,6 +1361,80 @@ dev_err: return err; } +static int vdpa_dev_net_device_attr_set(struct vdpa_device *vdev, + struct genl_info *info) +{ + struct vdpa_dev_set_config set_config = {}; + struct vdpa_mgmt_dev *mdev = vdev->mdev; + struct nlattr **nl_attrs = info->attrs; + const u8 *macaddr; + int err = -EOPNOTSUPP; + + down_write(&vdev->cf_lock); + if (nl_attrs[VDPA_ATTR_DEV_NET_CFG_MACADDR]) { + set_config.mask |= BIT_ULL(VDPA_ATTR_DEV_NET_CFG_MACADDR); + macaddr = nla_data(nl_attrs[VDPA_ATTR_DEV_NET_CFG_MACADDR]); + + if (is_valid_ether_addr(macaddr)) { + ether_addr_copy(set_config.net.mac, macaddr); + if (mdev->ops->dev_set_attr) { + err = mdev->ops->dev_set_attr(mdev, vdev, + &set_config); + } else { + NL_SET_ERR_MSG_FMT_MOD(info->extack, + "Operation not supported by the device."); + } + } else { + NL_SET_ERR_MSG_FMT_MOD(info->extack, + "Invalid MAC address"); + } + } + up_write(&vdev->cf_lock); + return err; +} + +static int vdpa_nl_cmd_dev_attr_set_doit(struct sk_buff *skb, + struct genl_info *info) +{ + struct vdpa_device *vdev; + struct device *dev; + const char *name; + u64 classes; + int err = 0; + + if (!info->attrs[VDPA_ATTR_DEV_NAME]) + return -EINVAL; + + name = nla_data(info->attrs[VDPA_ATTR_DEV_NAME]); + + down_write(&vdpa_dev_lock); + dev = bus_find_device(&vdpa_bus, NULL, name, vdpa_name_match); + if (!dev) { + NL_SET_ERR_MSG_MOD(info->extack, "device not found"); + err = -ENODEV; + goto dev_err; + } + vdev = container_of(dev, struct vdpa_device, dev); + if (!vdev->mdev) { + NL_SET_ERR_MSG_MOD(info->extack, "unmanaged vdpa device"); + err = -EINVAL; + goto mdev_err; + } + classes = vdpa_mgmtdev_get_classes(vdev->mdev, NULL); + if (classes & BIT_ULL(VIRTIO_ID_NET)) { + err = vdpa_dev_net_device_attr_set(vdev, info); + } else { + NL_SET_ERR_MSG_FMT_MOD(info->extack, "%s device not supported", + name); + } + +mdev_err: + put_device(dev); +dev_err: + up_write(&vdpa_dev_lock); + return err; +} + static int vdpa_dev_config_dump(struct device *dev, void *data) { struct vdpa_device *vdev = container_of(dev, struct vdpa_device, dev); @@ -1497,6 +1571,11 @@ static const struct genl_ops vdpa_nl_ops[] = { .doit = vdpa_nl_cmd_dev_stats_get_doit, .flags = GENL_ADMIN_PERM, }, + { + .cmd = VDPA_CMD_DEV_ATTR_SET, + .doit = vdpa_nl_cmd_dev_attr_set_doit, + .flags = GENL_ADMIN_PERM, + }, }; static struct genl_family vdpa_nl_family __ro_after_init = { diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h index 7977ca03ac7a..2e7a30fe6b92 100644 --- a/include/linux/vdpa.h +++ b/include/linux/vdpa.h @@ -582,11 +582,20 @@ void vdpa_set_status(struct vdpa_device *vdev, u8 status); * @dev: vdpa device to remove * Driver need to remove the specified device by calling * _vdpa_unregister_device(). + * @dev_set_attr: change a vdpa device's attr after it was create + * @mdev: parent device to use for device + * @dev: vdpa device structure + * @config:Attributes to be set for the device. + * The driver needs to check the mask of the structure and then set + * the related information to the vdpa device. The driver must return 0 + * if set successfully. */ struct vdpa_mgmtdev_ops { int (*dev_add)(struct vdpa_mgmt_dev *mdev, const char *name, const struct vdpa_dev_set_config *config); void (*dev_del)(struct vdpa_mgmt_dev *mdev, struct vdpa_device *dev); + int (*dev_set_attr)(struct vdpa_mgmt_dev *mdev, struct vdpa_device *dev, + const struct vdpa_dev_set_config *config); }; /** diff --git a/include/uapi/linux/vdpa.h b/include/uapi/linux/vdpa.h index 842bf1201ac4..71edf2c70cc3 100644 --- a/include/uapi/linux/vdpa.h +++ b/include/uapi/linux/vdpa.h @@ -19,6 +19,7 @@ enum vdpa_command { VDPA_CMD_DEV_GET, /* can dump */ VDPA_CMD_DEV_CONFIG_GET, /* can dump */ VDPA_CMD_DEV_VSTATS_GET, + VDPA_CMD_DEV_ATTR_SET, }; enum vdpa_attr { -- cgit v1.2.3 From be8e9eb3750639aa5cffb3f764ca080caed41bd0 Mon Sep 17 00:00:00 2001 From: Jason Xing Date: Mon, 9 Sep 2024 09:56:11 +0800 Subject: net-timestamp: introduce SOF_TIMESTAMPING_OPT_RX_FILTER flag introduce a new flag SOF_TIMESTAMPING_OPT_RX_FILTER in the receive path. User can set it with SOF_TIMESTAMPING_SOFTWARE to filter out rx software timestamp report, especially after a process turns on netstamp_needed_key which can time stamp every incoming skb. Previously, we found out if an application starts first which turns on netstamp_needed_key, then another one only passing SOF_TIMESTAMPING_SOFTWARE could also get rx timestamp. Now we handle this case by introducing this new flag without breaking users. Quoting Willem to explain why we need the flag: "why a process would want to request software timestamp reporting, but not receive software timestamp generation. The only use I see is when the application does request SOF_TIMESTAMPING_SOFTWARE | SOF_TIMESTAMPING_TX_SOFTWARE." Similarly, this new flag could also be used for hardware case where we can set it with SOF_TIMESTAMPING_RAW_HARDWARE, then we won't receive hardware receive timestamp. Another thing about errqueue in this patch I have a few words to say: In this case, we need to handle the egress path carefully, or else reporting the tx timestamp will fail. Egress path and ingress path will finally call sock_recv_timestamp(). We have to distinguish them. Errqueue is a good indicator to reflect the flow direction. Suggested-by: Willem de Bruijn Signed-off-by: Jason Xing Reviewed-by: Willem de Bruijn Link: https://patch.msgid.link/20240909015612.3856-2-kerneljasonxing@gmail.com Signed-off-by: Jakub Kicinski --- Documentation/networking/timestamping.rst | 17 +++++++++++++++++ include/uapi/linux/net_tstamp.h | 3 ++- net/ethtool/common.c | 1 + net/ipv4/tcp.c | 9 +++++++-- net/socket.c | 10 ++++++++-- 5 files changed, 35 insertions(+), 5 deletions(-) (limited to 'include/uapi') diff --git a/Documentation/networking/timestamping.rst b/Documentation/networking/timestamping.rst index 9c7773271393..8199e6917671 100644 --- a/Documentation/networking/timestamping.rst +++ b/Documentation/networking/timestamping.rst @@ -267,6 +267,23 @@ SOF_TIMESTAMPING_OPT_TX_SWHW: two separate messages will be looped to the socket's error queue, each containing just one timestamp. +SOF_TIMESTAMPING_OPT_RX_FILTER: + Filter out spurious receive timestamps: report a receive timestamp + only if the matching timestamp generation flag is enabled. + + Receive timestamps are generated early in the ingress path, before a + packet's destination socket is known. If any socket enables receive + timestamps, packets for all socket will receive timestamped packets. + Including those that request timestamp reporting with + SOF_TIMESTAMPING_SOFTWARE and/or SOF_TIMESTAMPING_RAW_HARDWARE, but + do not request receive timestamp generation. This can happen when + requesting transmit timestamps only. + + Receiving spurious timestamps is generally benign. A process can + ignore the unexpected non-zero value. But it makes behavior subtly + dependent on other sockets. This flag isolates the socket for more + deterministic behavior. + New applications are encouraged to pass SOF_TIMESTAMPING_OPT_ID to disambiguate timestamps and SOF_TIMESTAMPING_OPT_TSONLY to operate regardless of the setting of sysctl net.core.tstamp_allow_data. diff --git a/include/uapi/linux/net_tstamp.h b/include/uapi/linux/net_tstamp.h index a2c66b3d7f0f..858339d1c1c4 100644 --- a/include/uapi/linux/net_tstamp.h +++ b/include/uapi/linux/net_tstamp.h @@ -32,8 +32,9 @@ enum { SOF_TIMESTAMPING_OPT_TX_SWHW = (1<<14), SOF_TIMESTAMPING_BIND_PHC = (1 << 15), SOF_TIMESTAMPING_OPT_ID_TCP = (1 << 16), + SOF_TIMESTAMPING_OPT_RX_FILTER = (1 << 17), - SOF_TIMESTAMPING_LAST = SOF_TIMESTAMPING_OPT_ID_TCP, + SOF_TIMESTAMPING_LAST = SOF_TIMESTAMPING_OPT_RX_FILTER, SOF_TIMESTAMPING_MASK = (SOF_TIMESTAMPING_LAST - 1) | SOF_TIMESTAMPING_LAST }; diff --git a/net/ethtool/common.c b/net/ethtool/common.c index 781834ef57c3..6c245e59bbc1 100644 --- a/net/ethtool/common.c +++ b/net/ethtool/common.c @@ -427,6 +427,7 @@ const char sof_timestamping_names[][ETH_GSTRING_LEN] = { [const_ilog2(SOF_TIMESTAMPING_OPT_TX_SWHW)] = "option-tx-swhw", [const_ilog2(SOF_TIMESTAMPING_BIND_PHC)] = "bind-phc", [const_ilog2(SOF_TIMESTAMPING_OPT_ID_TCP)] = "option-id-tcp", + [const_ilog2(SOF_TIMESTAMPING_OPT_RX_FILTER)] = "option-rx-filter", }; static_assert(ARRAY_SIZE(sof_timestamping_names) == __SOF_TIMESTAMPING_CNT); diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 8a5680b4e786..e359a9161445 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -2235,6 +2235,7 @@ void tcp_recv_timestamp(struct msghdr *msg, const struct sock *sk, struct scm_timestamping_internal *tss) { int new_tstamp = sock_flag(sk, SOCK_TSTAMP_NEW); + u32 tsflags = READ_ONCE(sk->sk_tsflags); bool has_timestamping = false; if (tss->ts[0].tv_sec || tss->ts[0].tv_nsec) { @@ -2274,14 +2275,18 @@ void tcp_recv_timestamp(struct msghdr *msg, const struct sock *sk, } } - if (READ_ONCE(sk->sk_tsflags) & SOF_TIMESTAMPING_SOFTWARE) + if (tsflags & SOF_TIMESTAMPING_SOFTWARE && + (tsflags & SOF_TIMESTAMPING_RX_SOFTWARE || + !(tsflags & SOF_TIMESTAMPING_OPT_RX_FILTER))) has_timestamping = true; else tss->ts[0] = (struct timespec64) {0}; } if (tss->ts[2].tv_sec || tss->ts[2].tv_nsec) { - if (READ_ONCE(sk->sk_tsflags) & SOF_TIMESTAMPING_RAW_HARDWARE) + if (tsflags & SOF_TIMESTAMPING_RAW_HARDWARE && + (tsflags & SOF_TIMESTAMPING_RX_HARDWARE || + !(tsflags & SOF_TIMESTAMPING_OPT_RX_FILTER))) has_timestamping = true; else tss->ts[2] = (struct timespec64) {0}; diff --git a/net/socket.c b/net/socket.c index 0a2bd22ec105..8d8b84fa404a 100644 --- a/net/socket.c +++ b/net/socket.c @@ -946,11 +946,17 @@ void __sock_recv_timestamp(struct msghdr *msg, struct sock *sk, memset(&tss, 0, sizeof(tss)); tsflags = READ_ONCE(sk->sk_tsflags); - if ((tsflags & SOF_TIMESTAMPING_SOFTWARE) && + if ((tsflags & SOF_TIMESTAMPING_SOFTWARE && + (tsflags & SOF_TIMESTAMPING_RX_SOFTWARE || + skb_is_err_queue(skb) || + !(tsflags & SOF_TIMESTAMPING_OPT_RX_FILTER))) && ktime_to_timespec64_cond(skb->tstamp, tss.ts + 0)) empty = 0; if (shhwtstamps && - (tsflags & SOF_TIMESTAMPING_RAW_HARDWARE) && + (tsflags & SOF_TIMESTAMPING_RAW_HARDWARE && + (tsflags & SOF_TIMESTAMPING_RX_HARDWARE || + skb_is_err_queue(skb) || + !(tsflags & SOF_TIMESTAMPING_OPT_RX_FILTER))) && !skb_is_swtx_tstamp(skb, false_tstamp)) { if_index = 0; if (skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP_NETDEV) -- cgit v1.2.3 From 87f10faf166a9114aa0d4132298cad379de16fdd Mon Sep 17 00:00:00 2001 From: Bjorn Helgaas Date: Tue, 27 Aug 2024 18:48:48 -0500 Subject: PCI: Rename CRS Completion Status to RRS PCIe r6.0 changed the abbreviation for "Configuration Request Retry Status" Completion Status from "CRS" to "RRS" and uses the terminology of "Configuration RRS Software Visibility" instead of "CRS Software Visibility". Align the Linux usage with the r6.0 spec language. No functional change intended. It's confusing to make this change, but I think "RRS" *is* a better abbreviation because it was easy to interpret "CRS" as "Completion Retry Status", which really didn't make any sense. Link: https://lore.kernel.org/r/20240827234848.4429-4-helgaas@kernel.org Signed-off-by: Bjorn Helgaas --- drivers/bcma/driver_pci_host.c | 10 ++--- drivers/pci/controller/dwc/pcie-tegra194.c | 18 ++++----- drivers/pci/controller/pci-aardvark.c | 64 +++++++++++++++--------------- drivers/pci/controller/pci-xgene.c | 6 +-- drivers/pci/controller/pcie-iproc.c | 18 ++++----- drivers/pci/pci-bridge-emul.c | 4 +- drivers/pci/pci.c | 4 +- drivers/pci/pci.h | 8 ++-- drivers/pci/probe.c | 28 ++++++------- include/linux/bcma/bcma_driver_pci.h | 2 +- include/linux/pci.h | 2 +- include/uapi/linux/pci_regs.h | 6 ++- 12 files changed, 86 insertions(+), 84 deletions(-) (limited to 'include/uapi') diff --git a/drivers/bcma/driver_pci_host.c b/drivers/bcma/driver_pci_host.c index ed3be52ab63d..8540052d37c5 100644 --- a/drivers/bcma/driver_pci_host.c +++ b/drivers/bcma/driver_pci_host.c @@ -334,7 +334,7 @@ static u8 bcma_find_pci_capability(struct bcma_drv_pci *pc, unsigned int dev, } /* If the root port is capable of returning Config Request - * Retry Status (CRS) Completion Status to software then + * Retry Status (RRS) Completion Status to software then * enable the feature. */ static void bcma_core_pci_enable_crs(struct bcma_drv_pci *pc) @@ -348,10 +348,10 @@ static void bcma_core_pci_enable_crs(struct bcma_drv_pci *pc) NULL); root_cap = cap_ptr + PCI_EXP_RTCAP; bcma_extpci_read_config(pc, 0, 0, root_cap, &val16, sizeof(u16)); - if (val16 & BCMA_CORE_PCI_RC_CRS_VISIBILITY) { - /* Enable CRS software visibility */ + if (val16 & BCMA_CORE_PCI_RC_RRS_VISIBILITY) { + /* Enable Configuration RRS Software Visibility */ root_ctrl = cap_ptr + PCI_EXP_RTCTL; - val16 = PCI_EXP_RTCTL_CRSSVE; + val16 = PCI_EXP_RTCTL_RRS_SVE; bcma_extpci_read_config(pc, 0, 0, root_ctrl, &val16, sizeof(u16)); @@ -360,7 +360,7 @@ static void bcma_core_pci_enable_crs(struct bcma_drv_pci *pc) * 100 ms wait time from the end of Reset. If the device is * not done with its internal initialization, it must at * least return a completion TLP, with a completion status - * of "Configuration Request Retry Status (CRS)". The root + * of "Configuration Request Retry Status (RRS)". The root * complex must complete the request to the host by returning * a read-data value of 0001h for the Vendor ID field and * all 1s for any additional bytes included in the request. diff --git a/drivers/pci/controller/dwc/pcie-tegra194.c b/drivers/pci/controller/dwc/pcie-tegra194.c index 4bf7b433417a..886354342ef1 100644 --- a/drivers/pci/controller/dwc/pcie-tegra194.c +++ b/drivers/pci/controller/dwc/pcie-tegra194.c @@ -183,11 +183,11 @@ #define GEN3_EQ_CONTROL_OFF_FB_MODE_MASK GENMASK(3, 0) #define PORT_LOGIC_AMBA_ERROR_RESPONSE_DEFAULT 0x8D0 -#define AMBA_ERROR_RESPONSE_CRS_SHIFT 3 -#define AMBA_ERROR_RESPONSE_CRS_MASK GENMASK(1, 0) -#define AMBA_ERROR_RESPONSE_CRS_OKAY 0 -#define AMBA_ERROR_RESPONSE_CRS_OKAY_FFFFFFFF 1 -#define AMBA_ERROR_RESPONSE_CRS_OKAY_FFFF0001 2 +#define AMBA_ERROR_RESPONSE_RRS_SHIFT 3 +#define AMBA_ERROR_RESPONSE_RRS_MASK GENMASK(1, 0) +#define AMBA_ERROR_RESPONSE_RRS_OKAY 0 +#define AMBA_ERROR_RESPONSE_RRS_OKAY_FFFFFFFF 1 +#define AMBA_ERROR_RESPONSE_RRS_OKAY_FFFF0001 2 #define MSIX_ADDR_MATCH_LOW_OFF 0x940 #define MSIX_ADDR_MATCH_LOW_OFF_EN BIT(0) @@ -907,11 +907,11 @@ static int tegra_pcie_dw_host_init(struct dw_pcie_rp *pp) dw_pcie_writel_dbi(pci, PCI_BASE_ADDRESS_0, 0); - /* Enable as 0xFFFF0001 response for CRS */ + /* Enable as 0xFFFF0001 response for RRS */ val = dw_pcie_readl_dbi(pci, PORT_LOGIC_AMBA_ERROR_RESPONSE_DEFAULT); - val &= ~(AMBA_ERROR_RESPONSE_CRS_MASK << AMBA_ERROR_RESPONSE_CRS_SHIFT); - val |= (AMBA_ERROR_RESPONSE_CRS_OKAY_FFFF0001 << - AMBA_ERROR_RESPONSE_CRS_SHIFT); + val &= ~(AMBA_ERROR_RESPONSE_RRS_MASK << AMBA_ERROR_RESPONSE_RRS_SHIFT); + val |= (AMBA_ERROR_RESPONSE_RRS_OKAY_FFFF0001 << + AMBA_ERROR_RESPONSE_RRS_SHIFT); dw_pcie_writel_dbi(pci, PORT_LOGIC_AMBA_ERROR_RESPONSE_DEFAULT, val); /* Clear Slot Clock Configuration bit if SRNS configuration */ diff --git a/drivers/pci/controller/pci-aardvark.c b/drivers/pci/controller/pci-aardvark.c index e66594558ce2..afccae3bfbc6 100644 --- a/drivers/pci/controller/pci-aardvark.c +++ b/drivers/pci/controller/pci-aardvark.c @@ -50,7 +50,7 @@ #define PIO_COMPLETION_STATUS_MASK GENMASK(9, 7) #define PIO_COMPLETION_STATUS_OK 0 #define PIO_COMPLETION_STATUS_UR 1 -#define PIO_COMPLETION_STATUS_CRS 2 +#define PIO_COMPLETION_STATUS_RRS 2 #define PIO_COMPLETION_STATUS_CA 4 #define PIO_NON_POSTED_REQ BIT(10) #define PIO_ERR_STATUS BIT(11) @@ -262,7 +262,7 @@ enum { #define MSI_IRQ_NUM 32 -#define CFG_RD_CRS_VAL 0xffff0001 +#define CFG_RD_RRS_VAL 0xffff0001 struct advk_pcie { struct platform_device *pdev; @@ -649,7 +649,7 @@ static void advk_pcie_setup_hw(struct advk_pcie *pcie) advk_pcie_train_link(pcie); } -static int advk_pcie_check_pio_status(struct advk_pcie *pcie, bool allow_crs, u32 *val) +static int advk_pcie_check_pio_status(struct advk_pcie *pcie, bool allow_rrs, u32 *val) { struct device *dev = &pcie->pdev->dev; u32 reg; @@ -669,7 +669,7 @@ static int advk_pcie_check_pio_status(struct advk_pcie *pcie, bool allow_crs, u3 * 2) value Unsupported Request(1) of COMPLETION_STATUS(bit9:7) only * means a PIO write error, and for PIO read it is successful with * a read value of 0xFFFFFFFF. - * 3) value Completion Retry Status(CRS) of COMPLETION_STATUS(bit9:7) + * 3) value Config Request Retry Status(RRS) of COMPLETION_STATUS(bit9:7) * only means a PIO write error, and for PIO read it is successful * with a read value of 0xFFFF0001. * 4) value Completer Abort (CA) of COMPLETION_STATUS(bit9:7) means @@ -694,10 +694,10 @@ static int advk_pcie_check_pio_status(struct advk_pcie *pcie, bool allow_crs, u3 strcomp_status = "UR"; ret = -EOPNOTSUPP; break; - case PIO_COMPLETION_STATUS_CRS: - if (allow_crs && val) { - /* PCIe r4.0, sec 2.3.2, says: - * If CRS Software Visibility is enabled: + case PIO_COMPLETION_STATUS_RRS: + if (allow_rrs && val) { + /* PCIe r6.0, sec 2.3.2, says: + * If Configuration RRS Software Visibility is enabled: * For a Configuration Read Request that includes both * bytes of the Vendor ID field of a device Function's * Configuration Space Header, the Root Complex must @@ -706,22 +706,22 @@ static int advk_pcie_check_pio_status(struct advk_pcie *pcie, bool allow_crs, u3 * all '1's for any additional bytes included in the * request. * - * So CRS in this case is not an error status. + * So RRS in this case is not an error status. */ - *val = CFG_RD_CRS_VAL; + *val = CFG_RD_RRS_VAL; strcomp_status = NULL; ret = 0; break; } - /* PCIe r4.0, sec 2.3.2, says: - * If CRS Software Visibility is not enabled, the Root Complex + /* PCIe r6.0, sec 2.3.2, says: + * If RRS Software Visibility is not enabled, the Root Complex * must re-issue the Configuration Request as a new Request. - * If CRS Software Visibility is enabled: For a Configuration + * If RRS Software Visibility is enabled: For a Configuration * Write Request or for any other Configuration Read Request, * the Root Complex must re-issue the Configuration Request as * a new Request. * A Root Complex implementation may choose to limit the number - * of Configuration Request/CRS Completion Status loops before + * of Configuration Request/RRS Completion Status loops before * determining that something is wrong with the target of the * Request and taking appropriate action, e.g., complete the * Request to the host as a failed transaction. @@ -729,7 +729,7 @@ static int advk_pcie_check_pio_status(struct advk_pcie *pcie, bool allow_crs, u3 * So return -EAGAIN and caller (pci-aardvark.c driver) will * re-issue request again up to the PIO_RETRY_CNT retries. */ - strcomp_status = "CRS"; + strcomp_status = "RRS"; ret = -EAGAIN; break; case PIO_COMPLETION_STATUS_CA: @@ -920,8 +920,8 @@ advk_pci_bridge_emul_pcie_conf_write(struct pci_bridge_emul *bridge, case PCI_EXP_RTCTL: { u16 rootctl = le16_to_cpu(bridge->pcie_conf.rootctl); - /* Only emulation of PMEIE and CRSSVE bits is provided */ - rootctl &= PCI_EXP_RTCTL_PMEIE | PCI_EXP_RTCTL_CRSSVE; + /* Only emulation of PMEIE and RRS_SVE bits is provided */ + rootctl &= PCI_EXP_RTCTL_PMEIE | PCI_EXP_RTCTL_RRS_SVE; bridge->pcie_conf.rootctl = cpu_to_le16(rootctl); break; } @@ -1075,7 +1075,7 @@ static int advk_sw_pci_bridge_init(struct advk_pcie *pcie) bridge->pcie_conf.slotsta = cpu_to_le16(PCI_EXP_SLTSTA_PDS); /* Indicates supports for Completion Retry Status */ - bridge->pcie_conf.rootcap = cpu_to_le16(PCI_EXP_RTCAP_CRSVIS); + bridge->pcie_conf.rootcap = cpu_to_le16(PCI_EXP_RTCAP_RRS_SV); bridge->subsystem_vendor_id = advk_readl(pcie, PCIE_CORE_SSDEV_ID_REG) & 0xffff; bridge->subsystem_id = advk_readl(pcie, PCIE_CORE_SSDEV_ID_REG) >> 16; @@ -1141,7 +1141,7 @@ static int advk_pcie_rd_conf(struct pci_bus *bus, u32 devfn, { struct advk_pcie *pcie = bus->sysdata; int retry_count; - bool allow_crs; + bool allow_rrs; u32 reg; int ret; @@ -1153,16 +1153,16 @@ static int advk_pcie_rd_conf(struct pci_bus *bus, u32 devfn, size, val); /* - * Completion Retry Status is possible to return only when reading - * both bytes from PCI_VENDOR_ID at once and CRSSVE flag on Root - * Port is enabled. + * Configuration Request Retry Status (RRS) is possible to return + * only when reading both bytes from PCI_VENDOR_ID at once and + * RRS_SVE flag on Root Port is enabled. */ - allow_crs = (where == PCI_VENDOR_ID) && (size >= 2) && + allow_rrs = (where == PCI_VENDOR_ID) && (size >= 2) && (le16_to_cpu(pcie->bridge.pcie_conf.rootctl) & - PCI_EXP_RTCTL_CRSSVE); + PCI_EXP_RTCTL_RRS_SVE); if (advk_pcie_pio_is_running(pcie)) - goto try_crs; + goto try_rrs; /* Program the control register */ reg = advk_readl(pcie, PIO_CTRL); @@ -1189,12 +1189,12 @@ static int advk_pcie_rd_conf(struct pci_bus *bus, u32 devfn, ret = advk_pcie_wait_pio(pcie); if (ret < 0) - goto try_crs; + goto try_rrs; retry_count += ret; /* Check PIO status and get the read result */ - ret = advk_pcie_check_pio_status(pcie, allow_crs, val); + ret = advk_pcie_check_pio_status(pcie, allow_rrs, val); } while (ret == -EAGAIN && retry_count < PIO_RETRY_CNT); if (ret < 0) @@ -1207,13 +1207,13 @@ static int advk_pcie_rd_conf(struct pci_bus *bus, u32 devfn, return PCIBIOS_SUCCESSFUL; -try_crs: +try_rrs: /* - * If it is possible, return Completion Retry Status so that caller - * tries to issue the request again instead of failing. + * If it is possible, return Configuration Request Retry Status so + * that caller tries to issue the request again instead of failing. */ - if (allow_crs) { - *val = CFG_RD_CRS_VAL; + if (allow_rrs) { + *val = CFG_RD_RRS_VAL; return PCIBIOS_SUCCESSFUL; } diff --git a/drivers/pci/controller/pci-xgene.c b/drivers/pci/controller/pci-xgene.c index 8e457fa450a2..1e2ebbfa36d1 100644 --- a/drivers/pci/controller/pci-xgene.c +++ b/drivers/pci/controller/pci-xgene.c @@ -171,17 +171,17 @@ static int xgene_pcie_config_read32(struct pci_bus *bus, unsigned int devfn, /* * The v1 controller has a bug in its Configuration Request Retry - * Status (CRS) logic: when CRS Software Visibility is enabled and + * Status (RRS) logic: when RRS Software Visibility is enabled and * we read the Vendor and Device ID of a non-existent device, the * controller fabricates return data of 0xFFFF0001 ("device exists * but is not ready") instead of 0xFFFFFFFF (PCI_ERROR_RESPONSE) * ("device does not exist"). This causes the PCI core to retry * the read until it times out. Avoid this by not claiming to - * support CRS SV. + * support RRS SV. */ if (pci_is_root_bus(bus) && (port->version == XGENE_PCIE_IP_VER_1) && ((where & ~0x3) == XGENE_V1_PCI_EXP_CAP + PCI_EXP_RTCTL)) - *val &= ~(PCI_EXP_RTCAP_CRSVIS << 16); + *val &= ~(PCI_EXP_RTCAP_RRS_SV << 16); if (size <= 2) *val = (*val >> (8 * (where & 3))) & ((1 << (size * 8)) - 1); diff --git a/drivers/pci/controller/pcie-iproc.c b/drivers/pci/controller/pcie-iproc.c index 97f739a2c9f8..22134e95574b 100644 --- a/drivers/pci/controller/pcie-iproc.c +++ b/drivers/pci/controller/pcie-iproc.c @@ -54,7 +54,7 @@ #define CFG_RD_SUCCESS 0 #define CFG_RD_UR 1 -#define CFG_RD_CRS 2 +#define CFG_RD_RRS 2 #define CFG_RD_CA 3 #define CFG_RETRY_STATUS 0xffff0001 #define CFG_RETRY_STATUS_TIMEOUT_US 500000 /* 500 milliseconds */ @@ -485,31 +485,31 @@ static unsigned int iproc_pcie_cfg_retry(struct iproc_pcie *pcie, u32 status; /* - * As per PCIe spec r3.1, sec 2.3.2, CRS Software Visibility only + * As per PCIe r6.0, sec 2.3.2, Config RRS Software Visibility only * affects config reads of the Vendor ID. For config writes or any * other config reads, the Root may automatically reissue the * configuration request again as a new request. * * For config reads, this hardware returns CFG_RETRY_STATUS data - * when it receives a CRS completion, regardless of the address of - * the read or the CRS Software Visibility Enable bit. As a + * when it receives a RRS completion, regardless of the address of + * the read or the RRS Software Visibility Enable bit. As a * partial workaround for this, we retry in software any read that * returns CFG_RETRY_STATUS. * * Note that a non-Vendor ID config register may have a value of * CFG_RETRY_STATUS. If we read that, we can't distinguish it from - * a CRS completion, so we will incorrectly retry the read and + * a RRS completion, so we will incorrectly retry the read and * eventually return the wrong data (0xffffffff). */ data = readl(cfg_data_p); while (data == CFG_RETRY_STATUS && timeout--) { /* - * CRS state is set in CFG_RD status register + * RRS state is set in CFG_RD status register * This will handle the case where CFG_RETRY_STATUS is * valid config data. */ status = iproc_pcie_read_reg(pcie, IPROC_PCIE_CFG_RD_STATUS); - if (status != CFG_RD_CRS) + if (status != CFG_RD_RRS) return data; udelay(1); @@ -556,8 +556,8 @@ static void iproc_pcie_fix_cap(struct iproc_pcie *pcie, int where, u32 *val) break; case IPROC_PCI_EXP_CAP + PCI_EXP_RTCTL: - /* Don't advertise CRS SV support */ - *val &= ~(PCI_EXP_RTCAP_CRSVIS << 16); + /* Don't advertise RRS SV support */ + *val &= ~(PCI_EXP_RTCAP_RRS_SV << 16); break; default: diff --git a/drivers/pci/pci-bridge-emul.c b/drivers/pci/pci-bridge-emul.c index 9334b2dd4764..6658c1edd464 100644 --- a/drivers/pci/pci-bridge-emul.c +++ b/drivers/pci/pci-bridge-emul.c @@ -257,8 +257,8 @@ struct pci_bridge_reg_behavior pcie_cap_regs_behavior[PCI_CAP_PCIE_SIZEOF / 4] = */ .rw = (PCI_EXP_RTCTL_SECEE | PCI_EXP_RTCTL_SENFEE | PCI_EXP_RTCTL_SEFEE | PCI_EXP_RTCTL_PMEIE | - PCI_EXP_RTCTL_CRSSVE), - .ro = PCI_EXP_RTCAP_CRSVIS << 16, + PCI_EXP_RTCTL_RRS_SVE), + .ro = PCI_EXP_RTCAP_RRS_SV << 16, }, [PCI_EXP_RTSTA / 4] = { diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index fc2ecb7fe081..55d6124acb2d 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -1320,9 +1320,9 @@ static int pci_dev_wait(struct pci_dev *dev, char *reset_type, int timeout) return -ENOTTY; } - if (root && root->config_crs_sv) { + if (root && root->config_rrs_sv) { pci_read_config_dword(dev, PCI_VENDOR_ID, &id); - if (!pci_bus_crs_vendor_id(id)) + if (!pci_bus_rrs_vendor_id(id)) break; } else { pci_read_config_dword(dev, PCI_COMMAND, &id); diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index fa1997bc2667..061cc1b48fd8 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -139,7 +139,7 @@ bool pci_bridge_d3_possible(struct pci_dev *dev); void pci_bridge_d3_update(struct pci_dev *dev); int pci_bridge_wait_for_secondary_bus(struct pci_dev *dev, char *reset_type); -static inline bool pci_bus_crs_vendor_id(u32 l) +static inline bool pci_bus_rrs_vendor_id(u32 l) { return (l & 0xffff) == PCI_VENDOR_ID_PCI_SIG; } @@ -295,10 +295,10 @@ void pci_put_host_bridge_device(struct device *dev); int pci_configure_extended_tags(struct pci_dev *dev, void *ign); bool pci_bus_read_dev_vendor_id(struct pci_bus *bus, int devfn, u32 *pl, - int crs_timeout); + int rrs_timeout); bool pci_bus_generic_read_dev_vendor_id(struct pci_bus *bus, int devfn, u32 *pl, - int crs_timeout); -int pci_idt_bus_quirk(struct pci_bus *bus, int devfn, u32 *pl, int crs_timeout); + int rrs_timeout); +int pci_idt_bus_quirk(struct pci_bus *bus, int devfn, u32 *pl, int rrs_timeout); int pci_setup_device(struct pci_dev *dev); int __pci_read_base(struct pci_dev *dev, enum pci_bar_type type, diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c index b1615da9eb6b..e79d9bea32bf 100644 --- a/drivers/pci/probe.c +++ b/drivers/pci/probe.c @@ -1203,16 +1203,16 @@ struct pci_bus *pci_add_new_bus(struct pci_bus *parent, struct pci_dev *dev, } EXPORT_SYMBOL(pci_add_new_bus); -static void pci_enable_crs(struct pci_dev *pdev) +static void pci_enable_rrs_sv(struct pci_dev *pdev) { u16 root_cap = 0; - /* Enable CRS Software Visibility if supported */ + /* Enable Configuration RRS Software Visibility if supported */ pcie_capability_read_word(pdev, PCI_EXP_RTCAP, &root_cap); - if (root_cap & PCI_EXP_RTCAP_CRSVIS) { + if (root_cap & PCI_EXP_RTCAP_RRS_SV) { pcie_capability_set_word(pdev, PCI_EXP_RTCTL, - PCI_EXP_RTCTL_CRSSVE); - pdev->config_crs_sv = 1; + PCI_EXP_RTCTL_RRS_SVE); + pdev->config_rrs_sv = 1; } } @@ -1328,7 +1328,7 @@ static int pci_scan_bridge_extend(struct pci_bus *bus, struct pci_dev *dev, pci_write_config_word(dev, PCI_BRIDGE_CONTROL, bctl & ~PCI_BRIDGE_CTL_MASTER_ABORT); - pci_enable_crs(dev); + pci_enable_rrs_sv(dev); if ((secondary || subordinate) && !pcibios_assign_all_busses() && !is_cardbus && !broken) { @@ -2345,23 +2345,23 @@ struct pci_dev *pci_alloc_dev(struct pci_bus *bus) } EXPORT_SYMBOL(pci_alloc_dev); -static bool pci_bus_wait_crs(struct pci_bus *bus, int devfn, u32 *l, +static bool pci_bus_wait_rrs(struct pci_bus *bus, int devfn, u32 *l, int timeout) { int delay = 1; - if (!pci_bus_crs_vendor_id(*l)) - return true; /* not a CRS completion */ + if (!pci_bus_rrs_vendor_id(*l)) + return true; /* not a Configuration RRS completion */ if (!timeout) - return false; /* CRS, but caller doesn't want to wait */ + return false; /* RRS, but caller doesn't want to wait */ /* * We got the reserved Vendor ID that indicates a completion with - * Configuration Request Retry Status (CRS). Retry until we get a + * Configuration Request Retry Status (RRS). Retry until we get a * valid Vendor ID or we time out. */ - while (pci_bus_crs_vendor_id(*l)) { + while (pci_bus_rrs_vendor_id(*l)) { if (delay > timeout) { pr_warn("pci %04x:%02x:%02x.%d: not ready after %dms; giving up\n", pci_domain_nr(bus), bus->number, @@ -2400,8 +2400,8 @@ bool pci_bus_generic_read_dev_vendor_id(struct pci_bus *bus, int devfn, u32 *l, *l == 0x0000ffff || *l == 0xffff0000) return false; - if (pci_bus_crs_vendor_id(*l)) - return pci_bus_wait_crs(bus, devfn, l, timeout); + if (pci_bus_rrs_vendor_id(*l)) + return pci_bus_wait_rrs(bus, devfn, l, timeout); return true; } diff --git a/include/linux/bcma/bcma_driver_pci.h b/include/linux/bcma/bcma_driver_pci.h index 68da8dba5162..dba41b65ae0d 100644 --- a/include/linux/bcma/bcma_driver_pci.h +++ b/include/linux/bcma/bcma_driver_pci.h @@ -203,7 +203,7 @@ struct pci_dev; #define BCMA_CORE_PCI_MDIO_RXCTRL0 0x840 /* PCIE Root Capability Register bits (Host mode only) */ -#define BCMA_CORE_PCI_RC_CRS_VISIBILITY 0x0001 +#define BCMA_CORE_PCI_RC_RRS_VISIBILITY 0x0001 struct bcma_drv_pci; struct bcma_bus; diff --git a/include/linux/pci.h b/include/linux/pci.h index 121d8d94d6d0..27d8ce977674 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -371,7 +371,7 @@ struct pci_dev { can be generated */ unsigned int pme_poll:1; /* Poll device's PME status bit */ unsigned int pinned:1; /* Whether this dev is pinned */ - unsigned int config_crs_sv:1; /* Config CRS software visibility */ + unsigned int config_rrs_sv:1; /* Config RRS software visibility */ unsigned int imm_ready:1; /* Supports Immediate Readiness */ unsigned int d1_support:1; /* Low power state D1 is supported */ unsigned int d2_support:1; /* Low power state D2 is supported */ diff --git a/include/uapi/linux/pci_regs.h b/include/uapi/linux/pci_regs.h index 94c00996e633..f94591f9f5e9 100644 --- a/include/uapi/linux/pci_regs.h +++ b/include/uapi/linux/pci_regs.h @@ -634,9 +634,11 @@ #define PCI_EXP_RTCTL_SENFEE 0x0002 /* System Error on Non-Fatal Error */ #define PCI_EXP_RTCTL_SEFEE 0x0004 /* System Error on Fatal Error */ #define PCI_EXP_RTCTL_PMEIE 0x0008 /* PME Interrupt Enable */ -#define PCI_EXP_RTCTL_CRSSVE 0x0010 /* CRS Software Visibility Enable */ +#define PCI_EXP_RTCTL_RRS_SVE 0x0010 /* Config RRS Software Visibility Enable */ +#define PCI_EXP_RTCTL_CRSSVE PCI_EXP_RTCTL_RRS_SVE /* compatibility */ #define PCI_EXP_RTCAP 0x1e /* Root Capabilities */ -#define PCI_EXP_RTCAP_CRSVIS 0x0001 /* CRS Software Visibility capability */ +#define PCI_EXP_RTCAP_RRS_SV 0x0001 /* Config RRS Software Visibility */ +#define PCI_EXP_RTCAP_CRSVIS PCI_EXP_RTCAP_RRS_SV /* compatibility */ #define PCI_EXP_RTSTA 0x20 /* Root Status */ #define PCI_EXP_RTSTA_PME_RQ_ID 0x0000ffff /* PME Requester ID */ #define PCI_EXP_RTSTA_PME 0x00010000 /* PME status */ -- cgit v1.2.3 From 6ebf2d021a13a77b495b4de15c6834e26b80d08e Mon Sep 17 00:00:00 2001 From: Christian Loehle Date: Tue, 13 Aug 2024 15:43:46 +0100 Subject: sched/deadline: Clarify nanoseconds in uapi Specify the time values of the deadline parameters of deadline, runtime, and period as being in nanoseconds explicitly as they always have been. Signed-off-by: Christian Loehle Signed-off-by: Peter Zijlstra (Intel) Acked-by: Juri Lelli Acked-by: Rafael J. Wysocki Link: https://lore.kernel.org/r/20240813144348.1180344-3-christian.loehle@arm.com --- include/uapi/linux/sched/types.h | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) (limited to 'include/uapi') diff --git a/include/uapi/linux/sched/types.h b/include/uapi/linux/sched/types.h index 90662385689b..bf6e9ae031c1 100644 --- a/include/uapi/linux/sched/types.h +++ b/include/uapi/linux/sched/types.h @@ -58,9 +58,9 @@ * * This is reflected by the following fields of the sched_attr structure: * - * @sched_deadline representative of the task's deadline - * @sched_runtime representative of the task's runtime - * @sched_period representative of the task's period + * @sched_deadline representative of the task's deadline in nanoseconds + * @sched_runtime representative of the task's runtime in nanoseconds + * @sched_period representative of the task's period in nanoseconds * * Given this task model, there are a multiplicity of scheduling algorithms * and policies, that can be used to ensure all the tasks will make their -- cgit v1.2.3 From 50c52250e2d74b098465841163c18f4b4e9ad430 Mon Sep 17 00:00:00 2001 From: Pavel Begunkov Date: Wed, 11 Sep 2024 17:34:41 +0100 Subject: block: implement async io_uring discard cmd io_uring allows implementing custom file specific asynchronous operations via the fops->uring_cmd callback, a.k.a. IORING_OP_URING_CMD requests or just io_uring commands. Use it to add support for async discards. Normally, it first tries to queue up bios in a non-blocking context, and if that fails, we'd retry from a blocking context by returning -EAGAIN to the core io_uring. We always get the result from bios asynchronously by setting a custom bi_end_io callback, at which point we drag the request into the task context to either reissue or complete it and post a completion to the user. Unlike ioctl(BLKDISCARD) with stronger guarantees against races, we only do a best effort attempt to invalidate page cache, and it can race with any writes and reads and leave page cache stale. It's the same kind of races we allow to direct writes. Also, apart from cases where discarding is not allowed at all, e.g. discards are not supported or the file/device is read only, the user should assume that the sector range on disk is not valid anymore, even when an error was returned to the user. Suggested-by: Conrad Meyer Signed-off-by: Pavel Begunkov Link: https://lore.kernel.org/r/2b5210443e4fa0257934f73dfafcc18a77cd0e09.1726072086.git.asml.silence@gmail.com Signed-off-by: Jens Axboe --- block/blk.h | 1 + block/fops.c | 2 + block/ioctl.c | 112 ++++++++++++++++++++++++++++++++++++++++++++ include/uapi/linux/blkdev.h | 14 ++++++ 4 files changed, 129 insertions(+) create mode 100644 include/uapi/linux/blkdev.h (limited to 'include/uapi') diff --git a/block/blk.h b/block/blk.h index 86affb583eb6..c718e4291db0 100644 --- a/block/blk.h +++ b/block/blk.h @@ -609,6 +609,7 @@ blk_mode_t file_to_blk_mode(struct file *file); int truncate_bdev_range(struct block_device *bdev, blk_mode_t mode, loff_t lstart, loff_t lend); long blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg); +int blkdev_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags); long compat_blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg); extern const struct address_space_operations def_blk_aops; diff --git a/block/fops.c b/block/fops.c index 9825c1713a49..8154b10b5abf 100644 --- a/block/fops.c +++ b/block/fops.c @@ -17,6 +17,7 @@ #include #include #include +#include #include "blk.h" static inline struct inode *bdev_file_inode(struct file *file) @@ -873,6 +874,7 @@ const struct file_operations def_blk_fops = { .splice_read = filemap_splice_read, .splice_write = iter_file_splice_write, .fallocate = blkdev_fallocate, + .uring_cmd = blkdev_uring_cmd, .fop_flags = FOP_BUFFER_RASYNC, }; diff --git a/block/ioctl.c b/block/ioctl.c index 6eaa3f20bf34..6554b728bae6 100644 --- a/block/ioctl.c +++ b/block/ioctl.c @@ -11,6 +11,9 @@ #include #include #include +#include +#include +#include #include "blk.h" static int blkpg_do_ioctl(struct block_device *bdev, @@ -748,3 +751,112 @@ long compat_blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg) return ret; } #endif + +struct blk_iou_cmd { + int res; + bool nowait; +}; + +static void blk_cmd_complete(struct io_uring_cmd *cmd, unsigned int issue_flags) +{ + struct blk_iou_cmd *bic = io_uring_cmd_to_pdu(cmd, struct blk_iou_cmd); + + if (bic->res == -EAGAIN && bic->nowait) + io_uring_cmd_issue_blocking(cmd); + else + io_uring_cmd_done(cmd, bic->res, 0, issue_flags); +} + +static void bio_cmd_bio_end_io(struct bio *bio) +{ + struct io_uring_cmd *cmd = bio->bi_private; + struct blk_iou_cmd *bic = io_uring_cmd_to_pdu(cmd, struct blk_iou_cmd); + + if (unlikely(bio->bi_status) && !bic->res) + bic->res = blk_status_to_errno(bio->bi_status); + + io_uring_cmd_do_in_task_lazy(cmd, blk_cmd_complete); + bio_put(bio); +} + +static int blkdev_cmd_discard(struct io_uring_cmd *cmd, + struct block_device *bdev, + uint64_t start, uint64_t len, bool nowait) +{ + struct blk_iou_cmd *bic = io_uring_cmd_to_pdu(cmd, struct blk_iou_cmd); + gfp_t gfp = nowait ? GFP_NOWAIT : GFP_KERNEL; + sector_t sector = start >> SECTOR_SHIFT; + sector_t nr_sects = len >> SECTOR_SHIFT; + struct bio *prev = NULL, *bio; + int err; + + if (!bdev_max_discard_sectors(bdev)) + return -EOPNOTSUPP; + if (!(file_to_blk_mode(cmd->file) & BLK_OPEN_WRITE)) + return -EBADF; + if (bdev_read_only(bdev)) + return -EPERM; + err = blk_validate_byte_range(bdev, start, len); + if (err) + return err; + + err = filemap_invalidate_pages(bdev->bd_mapping, start, + start + len - 1, nowait); + if (err) + return err; + + while (true) { + bio = blk_alloc_discard_bio(bdev, §or, &nr_sects, gfp); + if (!bio) + break; + if (nowait) { + /* + * Don't allow multi-bio non-blocking submissions as + * subsequent bios may fail but we won't get a direct + * indication of that. Normally, the caller should + * retry from a blocking context. + */ + if (unlikely(nr_sects)) { + bio_put(bio); + return -EAGAIN; + } + bio->bi_opf |= REQ_NOWAIT; + } + + prev = bio_chain_and_submit(prev, bio); + } + if (unlikely(!prev)) + return -EAGAIN; + if (unlikely(nr_sects)) + bic->res = -EAGAIN; + + prev->bi_private = cmd; + prev->bi_end_io = bio_cmd_bio_end_io; + submit_bio(prev); + return -EIOCBQUEUED; +} + +int blkdev_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags) +{ + struct block_device *bdev = I_BDEV(cmd->file->f_mapping->host); + struct blk_iou_cmd *bic = io_uring_cmd_to_pdu(cmd, struct blk_iou_cmd); + const struct io_uring_sqe *sqe = cmd->sqe; + u32 cmd_op = cmd->cmd_op; + uint64_t start, len; + + if (unlikely(sqe->ioprio || sqe->__pad1 || sqe->len || + sqe->rw_flags || sqe->file_index)) + return -EINVAL; + + bic->res = 0; + bic->nowait = issue_flags & IO_URING_F_NONBLOCK; + + start = READ_ONCE(sqe->addr); + len = READ_ONCE(sqe->addr3); + + switch (cmd_op) { + case BLOCK_URING_CMD_DISCARD: + return blkdev_cmd_discard(cmd, bdev, start, len, bic->nowait); + } + return -EINVAL; +} diff --git a/include/uapi/linux/blkdev.h b/include/uapi/linux/blkdev.h new file mode 100644 index 000000000000..66373cd1a83a --- /dev/null +++ b/include/uapi/linux/blkdev.h @@ -0,0 +1,14 @@ +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ +#ifndef _UAPI_LINUX_BLKDEV_H +#define _UAPI_LINUX_BLKDEV_H + +#include +#include + +/* + * io_uring block file commands, see IORING_OP_URING_CMD. + * It's a different number space from ioctl(), reuse the block's code 0x12. + */ +#define BLOCK_URING_CMD_DISCARD _IO(0x12, 0) + +#endif -- cgit v1.2.3 From 3efd7ab46d0aebc3e567a9846e79a98dbad3291c Mon Sep 17 00:00:00 2001 From: Mina Almasry Date: Tue, 10 Sep 2024 17:14:46 +0000 Subject: net: netdev netlink api to bind dma-buf to a net device API takes the dma-buf fd as input, and binds it to the netdevice. The user can specify the rx queues to bind the dma-buf to. Suggested-by: Stanislav Fomichev Signed-off-by: Mina Almasry Reviewed-by: Donald Hunter Reviewed-by: Jakub Kicinski Link: https://patch.msgid.link/20240910171458.219195-3-almasrymina@google.com Signed-off-by: Jakub Kicinski --- Documentation/netlink/specs/netdev.yaml | 47 +++++++++++++++++++++++++++++++++ include/uapi/linux/netdev.h | 11 ++++++++ net/core/netdev-genl-gen.c | 19 +++++++++++++ net/core/netdev-genl-gen.h | 2 ++ net/core/netdev-genl.c | 6 +++++ tools/include/uapi/linux/netdev.h | 11 ++++++++ 6 files changed, 96 insertions(+) (limited to 'include/uapi') diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml index 959755be4d7f..4930e8142aa6 100644 --- a/Documentation/netlink/specs/netdev.yaml +++ b/Documentation/netlink/specs/netdev.yaml @@ -457,6 +457,39 @@ attribute-sets: Number of times driver re-started accepting send requests to this queue from the stack. type: uint + - + name: queue-id + subset-of: queue + attributes: + - + name: id + - + name: type + - + name: dmabuf + attributes: + - + name: ifindex + doc: netdev ifindex to bind the dmabuf to. + type: u32 + checks: + min: 1 + - + name: queues + doc: receive queues to bind the dmabuf to. + type: nest + nested-attributes: queue-id + multi-attr: true + - + name: fd + doc: dmabuf file descriptor to bind. + type: u32 + - + name: id + doc: id of the dmabuf binding + type: u32 + checks: + min: 1 operations: list: @@ -619,6 +652,20 @@ operations: - rx-bytes - tx-packets - tx-bytes + - + name: bind-rx + doc: Bind dmabuf to netdev + attribute-set: dmabuf + flags: [ admin-perm ] + do: + request: + attributes: + - ifindex + - fd + - queues + reply: + attributes: + - id mcast-groups: list: diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h index 43742ac5b00d..91bf3ecc5f1d 100644 --- a/include/uapi/linux/netdev.h +++ b/include/uapi/linux/netdev.h @@ -173,6 +173,16 @@ enum { NETDEV_A_QSTATS_MAX = (__NETDEV_A_QSTATS_MAX - 1) }; +enum { + NETDEV_A_DMABUF_IFINDEX = 1, + NETDEV_A_DMABUF_QUEUES, + NETDEV_A_DMABUF_FD, + NETDEV_A_DMABUF_ID, + + __NETDEV_A_DMABUF_MAX, + NETDEV_A_DMABUF_MAX = (__NETDEV_A_DMABUF_MAX - 1) +}; + enum { NETDEV_CMD_DEV_GET = 1, NETDEV_CMD_DEV_ADD_NTF, @@ -186,6 +196,7 @@ enum { NETDEV_CMD_QUEUE_GET, NETDEV_CMD_NAPI_GET, NETDEV_CMD_QSTATS_GET, + NETDEV_CMD_BIND_RX, __NETDEV_CMD_MAX, NETDEV_CMD_MAX = (__NETDEV_CMD_MAX - 1) diff --git a/net/core/netdev-genl-gen.c b/net/core/netdev-genl-gen.c index 8350a0afa9ec..6b7fe6035067 100644 --- a/net/core/netdev-genl-gen.c +++ b/net/core/netdev-genl-gen.c @@ -27,6 +27,11 @@ const struct nla_policy netdev_page_pool_info_nl_policy[NETDEV_A_PAGE_POOL_IFIND [NETDEV_A_PAGE_POOL_IFINDEX] = NLA_POLICY_FULL_RANGE(NLA_U32, &netdev_a_page_pool_ifindex_range), }; +const struct nla_policy netdev_queue_id_nl_policy[NETDEV_A_QUEUE_TYPE + 1] = { + [NETDEV_A_QUEUE_ID] = { .type = NLA_U32, }, + [NETDEV_A_QUEUE_TYPE] = NLA_POLICY_MAX(NLA_U32, 1), +}; + /* NETDEV_CMD_DEV_GET - do */ static const struct nla_policy netdev_dev_get_nl_policy[NETDEV_A_DEV_IFINDEX + 1] = { [NETDEV_A_DEV_IFINDEX] = NLA_POLICY_MIN(NLA_U32, 1), @@ -74,6 +79,13 @@ static const struct nla_policy netdev_qstats_get_nl_policy[NETDEV_A_QSTATS_SCOPE [NETDEV_A_QSTATS_SCOPE] = NLA_POLICY_MASK(NLA_UINT, 0x1), }; +/* NETDEV_CMD_BIND_RX - do */ +static const struct nla_policy netdev_bind_rx_nl_policy[NETDEV_A_DMABUF_FD + 1] = { + [NETDEV_A_DMABUF_IFINDEX] = NLA_POLICY_MIN(NLA_U32, 1), + [NETDEV_A_DMABUF_FD] = { .type = NLA_U32, }, + [NETDEV_A_DMABUF_QUEUES] = NLA_POLICY_NESTED(netdev_queue_id_nl_policy), +}; + /* Ops table for netdev */ static const struct genl_split_ops netdev_nl_ops[] = { { @@ -151,6 +163,13 @@ static const struct genl_split_ops netdev_nl_ops[] = { .maxattr = NETDEV_A_QSTATS_SCOPE, .flags = GENL_CMD_CAP_DUMP, }, + { + .cmd = NETDEV_CMD_BIND_RX, + .doit = netdev_nl_bind_rx_doit, + .policy = netdev_bind_rx_nl_policy, + .maxattr = NETDEV_A_DMABUF_FD, + .flags = GENL_ADMIN_PERM | GENL_CMD_CAP_DO, + }, }; static const struct genl_multicast_group netdev_nl_mcgrps[] = { diff --git a/net/core/netdev-genl-gen.h b/net/core/netdev-genl-gen.h index 4db40fd5b4a9..67c34005750c 100644 --- a/net/core/netdev-genl-gen.h +++ b/net/core/netdev-genl-gen.h @@ -13,6 +13,7 @@ /* Common nested types */ extern const struct nla_policy netdev_page_pool_info_nl_policy[NETDEV_A_PAGE_POOL_IFINDEX + 1]; +extern const struct nla_policy netdev_queue_id_nl_policy[NETDEV_A_QUEUE_TYPE + 1]; int netdev_nl_dev_get_doit(struct sk_buff *skb, struct genl_info *info); int netdev_nl_dev_get_dumpit(struct sk_buff *skb, struct netlink_callback *cb); @@ -30,6 +31,7 @@ int netdev_nl_napi_get_doit(struct sk_buff *skb, struct genl_info *info); int netdev_nl_napi_get_dumpit(struct sk_buff *skb, struct netlink_callback *cb); int netdev_nl_qstats_get_dumpit(struct sk_buff *skb, struct netlink_callback *cb); +int netdev_nl_bind_rx_doit(struct sk_buff *skb, struct genl_info *info); enum { NETDEV_NLGRP_MGMT, diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c index a17d7eaeb001..699c34b9b03c 100644 --- a/net/core/netdev-genl.c +++ b/net/core/netdev-genl.c @@ -723,6 +723,12 @@ int netdev_nl_qstats_get_dumpit(struct sk_buff *skb, return err; } +/* Stub */ +int netdev_nl_bind_rx_doit(struct sk_buff *skb, struct genl_info *info) +{ + return 0; +} + static int netdev_genl_netdevice_event(struct notifier_block *nb, unsigned long event, void *ptr) { diff --git a/tools/include/uapi/linux/netdev.h b/tools/include/uapi/linux/netdev.h index 43742ac5b00d..91bf3ecc5f1d 100644 --- a/tools/include/uapi/linux/netdev.h +++ b/tools/include/uapi/linux/netdev.h @@ -173,6 +173,16 @@ enum { NETDEV_A_QSTATS_MAX = (__NETDEV_A_QSTATS_MAX - 1) }; +enum { + NETDEV_A_DMABUF_IFINDEX = 1, + NETDEV_A_DMABUF_QUEUES, + NETDEV_A_DMABUF_FD, + NETDEV_A_DMABUF_ID, + + __NETDEV_A_DMABUF_MAX, + NETDEV_A_DMABUF_MAX = (__NETDEV_A_DMABUF_MAX - 1) +}; + enum { NETDEV_CMD_DEV_GET = 1, NETDEV_CMD_DEV_ADD_NTF, @@ -186,6 +196,7 @@ enum { NETDEV_CMD_QUEUE_GET, NETDEV_CMD_NAPI_GET, NETDEV_CMD_QSTATS_GET, + NETDEV_CMD_BIND_RX, __NETDEV_CMD_MAX, NETDEV_CMD_MAX = (__NETDEV_CMD_MAX - 1) -- cgit v1.2.3 From 8f0b3cc9a4c102c24808c87f1bc943659d7a7f9f Mon Sep 17 00:00:00 2001 From: Mina Almasry Date: Tue, 10 Sep 2024 17:14:53 +0000 Subject: tcp: RX path for devmem TCP In tcp_recvmsg_locked(), detect if the skb being received by the user is a devmem skb. In this case - if the user provided the MSG_SOCK_DEVMEM flag - pass it to tcp_recvmsg_devmem() for custom handling. tcp_recvmsg_devmem() copies any data in the skb header to the linear buffer, and returns a cmsg to the user indicating the number of bytes returned in the linear buffer. tcp_recvmsg_devmem() then loops over the unaccessible devmem skb frags, and returns to the user a cmsg_devmem indicating the location of the data in the dmabuf device memory. cmsg_devmem contains this information: 1. the offset into the dmabuf where the payload starts. 'frag_offset'. 2. the size of the frag. 'frag_size'. 3. an opaque token 'frag_token' to return to the kernel when the buffer is to be released. The pages awaiting freeing are stored in the newly added sk->sk_user_frags, and each page passed to userspace is get_page()'d. This reference is dropped once the userspace indicates that it is done reading this page. All pages are released when the socket is destroyed. Signed-off-by: Willem de Bruijn Signed-off-by: Kaiyuan Zhang Signed-off-by: Mina Almasry Reviewed-by: Pavel Begunkov Reviewed-by: Eric Dumazet Link: https://patch.msgid.link/20240910171458.219195-10-almasrymina@google.com Signed-off-by: Jakub Kicinski --- arch/alpha/include/uapi/asm/socket.h | 5 + arch/mips/include/uapi/asm/socket.h | 5 + arch/parisc/include/uapi/asm/socket.h | 5 + arch/sparc/include/uapi/asm/socket.h | 5 + include/linux/socket.h | 1 + include/net/sock.h | 2 + include/uapi/asm-generic/socket.h | 5 + include/uapi/linux/uio.h | 13 ++ net/core/devmem.h | 22 +++ net/ipv4/tcp.c | 257 +++++++++++++++++++++++++++++++++- net/ipv4/tcp_ipv4.c | 16 +++ net/ipv4/tcp_minisocks.c | 2 + 12 files changed, 333 insertions(+), 5 deletions(-) (limited to 'include/uapi') diff --git a/arch/alpha/include/uapi/asm/socket.h b/arch/alpha/include/uapi/asm/socket.h index e94f621903fe..ef4656a41058 100644 --- a/arch/alpha/include/uapi/asm/socket.h +++ b/arch/alpha/include/uapi/asm/socket.h @@ -140,6 +140,11 @@ #define SO_PASSPIDFD 76 #define SO_PEERPIDFD 77 +#define SO_DEVMEM_LINEAR 78 +#define SCM_DEVMEM_LINEAR SO_DEVMEM_LINEAR +#define SO_DEVMEM_DMABUF 79 +#define SCM_DEVMEM_DMABUF SO_DEVMEM_DMABUF + #if !defined(__KERNEL__) #if __BITS_PER_LONG == 64 diff --git a/arch/mips/include/uapi/asm/socket.h b/arch/mips/include/uapi/asm/socket.h index 60ebaed28a4c..414807d55e33 100644 --- a/arch/mips/include/uapi/asm/socket.h +++ b/arch/mips/include/uapi/asm/socket.h @@ -151,6 +151,11 @@ #define SO_PASSPIDFD 76 #define SO_PEERPIDFD 77 +#define SO_DEVMEM_LINEAR 78 +#define SCM_DEVMEM_LINEAR SO_DEVMEM_LINEAR +#define SO_DEVMEM_DMABUF 79 +#define SCM_DEVMEM_DMABUF SO_DEVMEM_DMABUF + #if !defined(__KERNEL__) #if __BITS_PER_LONG == 64 diff --git a/arch/parisc/include/uapi/asm/socket.h b/arch/parisc/include/uapi/asm/socket.h index be264c2b1a11..2b817efd4544 100644 --- a/arch/parisc/include/uapi/asm/socket.h +++ b/arch/parisc/include/uapi/asm/socket.h @@ -132,6 +132,11 @@ #define SO_PASSPIDFD 0x404A #define SO_PEERPIDFD 0x404B +#define SO_DEVMEM_LINEAR 78 +#define SCM_DEVMEM_LINEAR SO_DEVMEM_LINEAR +#define SO_DEVMEM_DMABUF 79 +#define SCM_DEVMEM_DMABUF SO_DEVMEM_DMABUF + #if !defined(__KERNEL__) #if __BITS_PER_LONG == 64 diff --git a/arch/sparc/include/uapi/asm/socket.h b/arch/sparc/include/uapi/asm/socket.h index 682da3714686..00248fc68977 100644 --- a/arch/sparc/include/uapi/asm/socket.h +++ b/arch/sparc/include/uapi/asm/socket.h @@ -133,6 +133,11 @@ #define SO_PASSPIDFD 0x0055 #define SO_PEERPIDFD 0x0056 +#define SO_DEVMEM_LINEAR 0x0057 +#define SCM_DEVMEM_LINEAR SO_DEVMEM_LINEAR +#define SO_DEVMEM_DMABUF 0x0058 +#define SCM_DEVMEM_DMABUF SO_DEVMEM_DMABUF + #if !defined(__KERNEL__) diff --git a/include/linux/socket.h b/include/linux/socket.h index df9cdb8bbfb8..d18cc47e89bd 100644 --- a/include/linux/socket.h +++ b/include/linux/socket.h @@ -327,6 +327,7 @@ struct ucred { * plain text and require encryption */ +#define MSG_SOCK_DEVMEM 0x2000000 /* Receive devmem skbs as cmsg */ #define MSG_ZEROCOPY 0x4000000 /* Use user data in kernel path */ #define MSG_SPLICE_PAGES 0x8000000 /* Splice the pages from the iterator in sendmsg() */ #define MSG_FASTOPEN 0x20000000 /* Send data in TCP SYN */ diff --git a/include/net/sock.h b/include/net/sock.h index f51d61fab059..c58ca8dd561b 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -337,6 +337,7 @@ struct sk_filter; * @sk_txtime_report_errors: set report errors mode for SO_TXTIME * @sk_txtime_unused: unused txtime flags * @ns_tracker: tracker for netns reference + * @sk_user_frags: xarray of pages the user is holding a reference on. */ struct sock { /* @@ -542,6 +543,7 @@ struct sock { #endif struct rcu_head sk_rcu; netns_tracker ns_tracker; + struct xarray sk_user_frags; }; struct sock_bh_locked { diff --git a/include/uapi/asm-generic/socket.h b/include/uapi/asm-generic/socket.h index 8ce8a39a1e5f..e993edc9c0ee 100644 --- a/include/uapi/asm-generic/socket.h +++ b/include/uapi/asm-generic/socket.h @@ -135,6 +135,11 @@ #define SO_PASSPIDFD 76 #define SO_PEERPIDFD 77 +#define SO_DEVMEM_LINEAR 78 +#define SCM_DEVMEM_LINEAR SO_DEVMEM_LINEAR +#define SO_DEVMEM_DMABUF 79 +#define SCM_DEVMEM_DMABUF SO_DEVMEM_DMABUF + #if !defined(__KERNEL__) #if __BITS_PER_LONG == 64 || (defined(__x86_64__) && defined(__ILP32__)) diff --git a/include/uapi/linux/uio.h b/include/uapi/linux/uio.h index 059b1a9147f4..3a22ddae376a 100644 --- a/include/uapi/linux/uio.h +++ b/include/uapi/linux/uio.h @@ -20,6 +20,19 @@ struct iovec __kernel_size_t iov_len; /* Must be size_t (1003.1g) */ }; +struct dmabuf_cmsg { + __u64 frag_offset; /* offset into the dmabuf where the frag starts. + */ + __u32 frag_size; /* size of the frag. */ + __u32 frag_token; /* token representing this frag for + * DEVMEM_DONTNEED. + */ + __u32 dmabuf_id; /* dmabuf id this frag belongs to. */ + __u32 flags; /* Currently unused. Reserved for future + * uses. + */ +}; + /* * UIO_MAXIOV shall be at least 16 1003.1g (5.4.1.1) */ diff --git a/net/core/devmem.h b/net/core/devmem.h index b1db4877cff9..76099ef9c482 100644 --- a/net/core/devmem.h +++ b/net/core/devmem.h @@ -91,6 +91,19 @@ net_iov_binding(const struct net_iov *niov) return net_iov_owner(niov)->binding; } +static inline unsigned long net_iov_virtual_addr(const struct net_iov *niov) +{ + struct dmabuf_genpool_chunk_owner *owner = net_iov_owner(niov); + + return owner->base_virtual + + ((unsigned long)net_iov_idx(niov) << PAGE_SHIFT); +} + +static inline u32 net_iov_binding_id(const struct net_iov *niov) +{ + return net_iov_owner(niov)->binding->id; +} + static inline void net_devmem_dmabuf_binding_get(struct net_devmem_dmabuf_binding *binding) { @@ -153,6 +166,15 @@ static inline void net_devmem_free_dmabuf(struct net_iov *ppiov) { } +static inline unsigned long net_iov_virtual_addr(const struct net_iov *niov) +{ + return 0; +} + +static inline u32 net_iov_binding_id(const struct net_iov *niov) +{ + return 0; +} #endif #endif /* _NET_DEVMEM_H */ diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index a2fac029a84a..4f77bd862e95 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -285,6 +285,8 @@ #include #include +#include "../core/devmem.h" + /* Track pending CMSGs. */ enum { TCP_CMSG_INQ = 1, @@ -471,6 +473,7 @@ void tcp_init_sock(struct sock *sk) set_bit(SOCK_SUPPORT_ZC, &sk->sk_socket->flags); sk_sockets_allocated_inc(sk); + xa_init_flags(&sk->sk_user_frags, XA_FLAGS_ALLOC1); } EXPORT_SYMBOL(tcp_init_sock); @@ -2328,6 +2331,220 @@ static int tcp_inq_hint(struct sock *sk) return inq; } +/* batch __xa_alloc() calls and reduce xa_lock()/xa_unlock() overhead. */ +struct tcp_xa_pool { + u8 max; /* max <= MAX_SKB_FRAGS */ + u8 idx; /* idx <= max */ + __u32 tokens[MAX_SKB_FRAGS]; + netmem_ref netmems[MAX_SKB_FRAGS]; +}; + +static void tcp_xa_pool_commit_locked(struct sock *sk, struct tcp_xa_pool *p) +{ + int i; + + /* Commit part that has been copied to user space. */ + for (i = 0; i < p->idx; i++) + __xa_cmpxchg(&sk->sk_user_frags, p->tokens[i], XA_ZERO_ENTRY, + (__force void *)p->netmems[i], GFP_KERNEL); + /* Rollback what has been pre-allocated and is no longer needed. */ + for (; i < p->max; i++) + __xa_erase(&sk->sk_user_frags, p->tokens[i]); + + p->max = 0; + p->idx = 0; +} + +static void tcp_xa_pool_commit(struct sock *sk, struct tcp_xa_pool *p) +{ + if (!p->max) + return; + + xa_lock_bh(&sk->sk_user_frags); + + tcp_xa_pool_commit_locked(sk, p); + + xa_unlock_bh(&sk->sk_user_frags); +} + +static int tcp_xa_pool_refill(struct sock *sk, struct tcp_xa_pool *p, + unsigned int max_frags) +{ + int err, k; + + if (p->idx < p->max) + return 0; + + xa_lock_bh(&sk->sk_user_frags); + + tcp_xa_pool_commit_locked(sk, p); + + for (k = 0; k < max_frags; k++) { + err = __xa_alloc(&sk->sk_user_frags, &p->tokens[k], + XA_ZERO_ENTRY, xa_limit_31b, GFP_KERNEL); + if (err) + break; + } + + xa_unlock_bh(&sk->sk_user_frags); + + p->max = k; + p->idx = 0; + return k ? 0 : err; +} + +/* On error, returns the -errno. On success, returns number of bytes sent to the + * user. May not consume all of @remaining_len. + */ +static int tcp_recvmsg_dmabuf(struct sock *sk, const struct sk_buff *skb, + unsigned int offset, struct msghdr *msg, + int remaining_len) +{ + struct dmabuf_cmsg dmabuf_cmsg = { 0 }; + struct tcp_xa_pool tcp_xa_pool; + unsigned int start; + int i, copy, n; + int sent = 0; + int err = 0; + + tcp_xa_pool.max = 0; + tcp_xa_pool.idx = 0; + do { + start = skb_headlen(skb); + + if (skb_frags_readable(skb)) { + err = -ENODEV; + goto out; + } + + /* Copy header. */ + copy = start - offset; + if (copy > 0) { + copy = min(copy, remaining_len); + + n = copy_to_iter(skb->data + offset, copy, + &msg->msg_iter); + if (n != copy) { + err = -EFAULT; + goto out; + } + + offset += copy; + remaining_len -= copy; + + /* First a dmabuf_cmsg for # bytes copied to user + * buffer. + */ + memset(&dmabuf_cmsg, 0, sizeof(dmabuf_cmsg)); + dmabuf_cmsg.frag_size = copy; + err = put_cmsg(msg, SOL_SOCKET, SO_DEVMEM_LINEAR, + sizeof(dmabuf_cmsg), &dmabuf_cmsg); + if (err || msg->msg_flags & MSG_CTRUNC) { + msg->msg_flags &= ~MSG_CTRUNC; + if (!err) + err = -ETOOSMALL; + goto out; + } + + sent += copy; + + if (remaining_len == 0) + goto out; + } + + /* after that, send information of dmabuf pages through a + * sequence of cmsg + */ + for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { + skb_frag_t *frag = &skb_shinfo(skb)->frags[i]; + struct net_iov *niov; + u64 frag_offset; + int end; + + /* !skb_frags_readable() should indicate that ALL the + * frags in this skb are dmabuf net_iovs. We're checking + * for that flag above, but also check individual frags + * here. If the tcp stack is not setting + * skb_frags_readable() correctly, we still don't want + * to crash here. + */ + if (!skb_frag_net_iov(frag)) { + net_err_ratelimited("Found non-dmabuf skb with net_iov"); + err = -ENODEV; + goto out; + } + + niov = skb_frag_net_iov(frag); + end = start + skb_frag_size(frag); + copy = end - offset; + + if (copy > 0) { + copy = min(copy, remaining_len); + + frag_offset = net_iov_virtual_addr(niov) + + skb_frag_off(frag) + offset - + start; + dmabuf_cmsg.frag_offset = frag_offset; + dmabuf_cmsg.frag_size = copy; + err = tcp_xa_pool_refill(sk, &tcp_xa_pool, + skb_shinfo(skb)->nr_frags - i); + if (err) + goto out; + + /* Will perform the exchange later */ + dmabuf_cmsg.frag_token = tcp_xa_pool.tokens[tcp_xa_pool.idx]; + dmabuf_cmsg.dmabuf_id = net_iov_binding_id(niov); + + offset += copy; + remaining_len -= copy; + + err = put_cmsg(msg, SOL_SOCKET, + SO_DEVMEM_DMABUF, + sizeof(dmabuf_cmsg), + &dmabuf_cmsg); + if (err || msg->msg_flags & MSG_CTRUNC) { + msg->msg_flags &= ~MSG_CTRUNC; + if (!err) + err = -ETOOSMALL; + goto out; + } + + atomic_long_inc(&niov->pp_ref_count); + tcp_xa_pool.netmems[tcp_xa_pool.idx++] = skb_frag_netmem(frag); + + sent += copy; + + if (remaining_len == 0) + goto out; + } + start = end; + } + + tcp_xa_pool_commit(sk, &tcp_xa_pool); + if (!remaining_len) + goto out; + + /* if remaining_len is not satisfied yet, we need to go to the + * next frag in the frag_list to satisfy remaining_len. + */ + skb = skb_shinfo(skb)->frag_list ?: skb->next; + + offset = offset - start; + } while (skb); + + if (remaining_len) { + err = -EFAULT; + goto out; + } + +out: + tcp_xa_pool_commit(sk, &tcp_xa_pool); + if (!sent) + sent = err; + + return sent; +} + /* * This routine copies from a sock struct into the user buffer. * @@ -2341,6 +2558,7 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, int *cmsg_flags) { struct tcp_sock *tp = tcp_sk(sk); + int last_copied_dmabuf = -1; /* uninitialized */ int copied = 0; u32 peek_seq; u32 *seq; @@ -2520,15 +2738,44 @@ found_ok_skb: } if (!(flags & MSG_TRUNC)) { - err = skb_copy_datagram_msg(skb, offset, msg, used); - if (err) { - /* Exception. Bailout! */ - if (!copied) - copied = -EFAULT; + if (last_copied_dmabuf != -1 && + last_copied_dmabuf != !skb_frags_readable(skb)) break; + + if (skb_frags_readable(skb)) { + err = skb_copy_datagram_msg(skb, offset, msg, + used); + if (err) { + /* Exception. Bailout! */ + if (!copied) + copied = -EFAULT; + break; + } + } else { + if (!(flags & MSG_SOCK_DEVMEM)) { + /* dmabuf skbs can only be received + * with the MSG_SOCK_DEVMEM flag. + */ + if (!copied) + copied = -EFAULT; + + break; + } + + err = tcp_recvmsg_dmabuf(sk, skb, offset, msg, + used); + if (err <= 0) { + if (!copied) + copied = -EFAULT; + + break; + } + used = err; } } + last_copied_dmabuf = !skb_frags_readable(skb); + WRITE_ONCE(*seq, *seq + used); copied += used; len -= used; diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index eb631e66ee03..5afe5e57c89b 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -79,6 +79,7 @@ #include #include #include +#include #include #include @@ -2512,10 +2513,25 @@ static void tcp_md5sig_info_free_rcu(struct rcu_head *head) } #endif +static void tcp_release_user_frags(struct sock *sk) +{ +#ifdef CONFIG_PAGE_POOL + unsigned long index; + void *netmem; + + xa_for_each(&sk->sk_user_frags, index, netmem) + WARN_ON_ONCE(!napi_pp_put_page((__force netmem_ref)netmem)); +#endif +} + void tcp_v4_destroy_sock(struct sock *sk) { struct tcp_sock *tp = tcp_sk(sk); + tcp_release_user_frags(sk); + + xa_destroy(&sk->sk_user_frags); + trace_tcp_destroy_sock(sk); tcp_clear_xmit_timers(sk); diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c index ad562272db2e..bb1fe1ba867a 100644 --- a/net/ipv4/tcp_minisocks.c +++ b/net/ipv4/tcp_minisocks.c @@ -628,6 +628,8 @@ struct sock *tcp_create_openreq_child(const struct sock *sk, __TCP_INC_STATS(sock_net(sk), TCP_MIB_PASSIVEOPENS); + xa_init_flags(&newsk->sk_user_frags, XA_FLAGS_ALLOC1); + return newsk; } EXPORT_SYMBOL(tcp_create_openreq_child); -- cgit v1.2.3 From 678f6e28b5f6fc2316f2c0fed8f8903101f1e128 Mon Sep 17 00:00:00 2001 From: Mina Almasry Date: Tue, 10 Sep 2024 17:14:54 +0000 Subject: net: add SO_DEVMEM_DONTNEED setsockopt to release RX frags Add an interface for the user to notify the kernel that it is done reading the devmem dmabuf frags returned as cmsg. The kernel will drop the reference on the frags to make them available for reuse. Signed-off-by: Willem de Bruijn Signed-off-by: Kaiyuan Zhang Signed-off-by: Mina Almasry Reviewed-by: Pavel Begunkov Reviewed-by: Eric Dumazet Link: https://patch.msgid.link/20240910171458.219195-11-almasrymina@google.com Signed-off-by: Jakub Kicinski --- arch/alpha/include/uapi/asm/socket.h | 1 + arch/mips/include/uapi/asm/socket.h | 1 + arch/parisc/include/uapi/asm/socket.h | 1 + arch/sparc/include/uapi/asm/socket.h | 1 + include/uapi/asm-generic/socket.h | 1 + include/uapi/linux/uio.h | 5 +++ net/core/sock.c | 68 +++++++++++++++++++++++++++++++++++ 7 files changed, 78 insertions(+) (limited to 'include/uapi') diff --git a/arch/alpha/include/uapi/asm/socket.h b/arch/alpha/include/uapi/asm/socket.h index ef4656a41058..251b73c5481e 100644 --- a/arch/alpha/include/uapi/asm/socket.h +++ b/arch/alpha/include/uapi/asm/socket.h @@ -144,6 +144,7 @@ #define SCM_DEVMEM_LINEAR SO_DEVMEM_LINEAR #define SO_DEVMEM_DMABUF 79 #define SCM_DEVMEM_DMABUF SO_DEVMEM_DMABUF +#define SO_DEVMEM_DONTNEED 80 #if !defined(__KERNEL__) diff --git a/arch/mips/include/uapi/asm/socket.h b/arch/mips/include/uapi/asm/socket.h index 414807d55e33..8ab7582291ab 100644 --- a/arch/mips/include/uapi/asm/socket.h +++ b/arch/mips/include/uapi/asm/socket.h @@ -155,6 +155,7 @@ #define SCM_DEVMEM_LINEAR SO_DEVMEM_LINEAR #define SO_DEVMEM_DMABUF 79 #define SCM_DEVMEM_DMABUF SO_DEVMEM_DMABUF +#define SO_DEVMEM_DONTNEED 80 #if !defined(__KERNEL__) diff --git a/arch/parisc/include/uapi/asm/socket.h b/arch/parisc/include/uapi/asm/socket.h index 2b817efd4544..38fc0b188e08 100644 --- a/arch/parisc/include/uapi/asm/socket.h +++ b/arch/parisc/include/uapi/asm/socket.h @@ -136,6 +136,7 @@ #define SCM_DEVMEM_LINEAR SO_DEVMEM_LINEAR #define SO_DEVMEM_DMABUF 79 #define SCM_DEVMEM_DMABUF SO_DEVMEM_DMABUF +#define SO_DEVMEM_DONTNEED 80 #if !defined(__KERNEL__) diff --git a/arch/sparc/include/uapi/asm/socket.h b/arch/sparc/include/uapi/asm/socket.h index 00248fc68977..57084ed2f3c4 100644 --- a/arch/sparc/include/uapi/asm/socket.h +++ b/arch/sparc/include/uapi/asm/socket.h @@ -137,6 +137,7 @@ #define SCM_DEVMEM_LINEAR SO_DEVMEM_LINEAR #define SO_DEVMEM_DMABUF 0x0058 #define SCM_DEVMEM_DMABUF SO_DEVMEM_DMABUF +#define SO_DEVMEM_DONTNEED 0x0059 #if !defined(__KERNEL__) diff --git a/include/uapi/asm-generic/socket.h b/include/uapi/asm-generic/socket.h index e993edc9c0ee..3b4e3e815602 100644 --- a/include/uapi/asm-generic/socket.h +++ b/include/uapi/asm-generic/socket.h @@ -139,6 +139,7 @@ #define SCM_DEVMEM_LINEAR SO_DEVMEM_LINEAR #define SO_DEVMEM_DMABUF 79 #define SCM_DEVMEM_DMABUF SO_DEVMEM_DMABUF +#define SO_DEVMEM_DONTNEED 80 #if !defined(__KERNEL__) diff --git a/include/uapi/linux/uio.h b/include/uapi/linux/uio.h index 3a22ddae376a..649739e0c404 100644 --- a/include/uapi/linux/uio.h +++ b/include/uapi/linux/uio.h @@ -33,6 +33,11 @@ struct dmabuf_cmsg { */ }; +struct dmabuf_token { + __u32 token_start; + __u32 token_count; +}; + /* * UIO_MAXIOV shall be at least 16 1003.1g (5.4.1.1) */ diff --git a/net/core/sock.c b/net/core/sock.c index 468b1239606c..bbb57b5af0b1 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -124,6 +124,7 @@ #include #include #include +#include #include #include #include @@ -1049,6 +1050,69 @@ static int sock_reserve_memory(struct sock *sk, int bytes) return 0; } +#ifdef CONFIG_PAGE_POOL + +/* This is the number of tokens that the user can SO_DEVMEM_DONTNEED in + * 1 syscall. The limit exists to limit the amount of memory the kernel + * allocates to copy these tokens. + */ +#define MAX_DONTNEED_TOKENS 128 + +static noinline_for_stack int +sock_devmem_dontneed(struct sock *sk, sockptr_t optval, unsigned int optlen) +{ + unsigned int num_tokens, i, j, k, netmem_num = 0; + struct dmabuf_token *tokens; + netmem_ref netmems[16]; + int ret = 0; + + if (!sk_is_tcp(sk)) + return -EBADF; + + if (optlen % sizeof(struct dmabuf_token) || + optlen > sizeof(*tokens) * MAX_DONTNEED_TOKENS) + return -EINVAL; + + tokens = kvmalloc_array(optlen, sizeof(*tokens), GFP_KERNEL); + if (!tokens) + return -ENOMEM; + + num_tokens = optlen / sizeof(struct dmabuf_token); + if (copy_from_sockptr(tokens, optval, optlen)) { + kvfree(tokens); + return -EFAULT; + } + + xa_lock_bh(&sk->sk_user_frags); + for (i = 0; i < num_tokens; i++) { + for (j = 0; j < tokens[i].token_count; j++) { + netmem_ref netmem = (__force netmem_ref)__xa_erase( + &sk->sk_user_frags, tokens[i].token_start + j); + + if (netmem && + !WARN_ON_ONCE(!netmem_is_net_iov(netmem))) { + netmems[netmem_num++] = netmem; + if (netmem_num == ARRAY_SIZE(netmems)) { + xa_unlock_bh(&sk->sk_user_frags); + for (k = 0; k < netmem_num; k++) + WARN_ON_ONCE(!napi_pp_put_page(netmems[k])); + netmem_num = 0; + xa_lock_bh(&sk->sk_user_frags); + } + ret++; + } + } + } + + xa_unlock_bh(&sk->sk_user_frags); + for (k = 0; k < netmem_num; k++) + WARN_ON_ONCE(!napi_pp_put_page(netmems[k])); + + kvfree(tokens); + return ret; +} +#endif + void sockopt_lock_sock(struct sock *sk) { /* When current->bpf_ctx is set, the setsockopt is called from @@ -1211,6 +1275,10 @@ int sk_setsockopt(struct sock *sk, int level, int optname, ret = -EOPNOTSUPP; return ret; } +#ifdef CONFIG_PAGE_POOL + case SO_DEVMEM_DONTNEED: + return sock_devmem_dontneed(sk, optval, optlen); +#endif } sockopt_lock_sock(sk); -- cgit v1.2.3 From d0caf9876a1c9f844307effb598ad1312d9e0025 Mon Sep 17 00:00:00 2001 From: Mina Almasry Date: Tue, 10 Sep 2024 17:14:57 +0000 Subject: netdev: add dmabuf introspection Add dmabuf information to page_pool stats: $ ./cli.py --spec ../netlink/specs/netdev.yaml --dump page-pool-get ... {'dmabuf': 10, 'id': 456, 'ifindex': 3, 'inflight': 1023, 'inflight-mem': 4190208}, {'dmabuf': 10, 'id': 455, 'ifindex': 3, 'inflight': 1023, 'inflight-mem': 4190208}, {'dmabuf': 10, 'id': 454, 'ifindex': 3, 'inflight': 1023, 'inflight-mem': 4190208}, {'dmabuf': 10, 'id': 453, 'ifindex': 3, 'inflight': 1023, 'inflight-mem': 4190208}, {'dmabuf': 10, 'id': 452, 'ifindex': 3, 'inflight': 1023, 'inflight-mem': 4190208}, {'dmabuf': 10, 'id': 451, 'ifindex': 3, 'inflight': 1023, 'inflight-mem': 4190208}, {'dmabuf': 10, 'id': 450, 'ifindex': 3, 'inflight': 1023, 'inflight-mem': 4190208}, {'dmabuf': 10, 'id': 449, 'ifindex': 3, 'inflight': 1023, 'inflight-mem': 4190208}, And queue stats: $ ./cli.py --spec ../netlink/specs/netdev.yaml --dump queue-get ... {'dmabuf': 10, 'id': 8, 'ifindex': 3, 'type': 'rx'}, {'dmabuf': 10, 'id': 9, 'ifindex': 3, 'type': 'rx'}, {'dmabuf': 10, 'id': 10, 'ifindex': 3, 'type': 'rx'}, {'dmabuf': 10, 'id': 11, 'ifindex': 3, 'type': 'rx'}, {'dmabuf': 10, 'id': 12, 'ifindex': 3, 'type': 'rx'}, {'dmabuf': 10, 'id': 13, 'ifindex': 3, 'type': 'rx'}, {'dmabuf': 10, 'id': 14, 'ifindex': 3, 'type': 'rx'}, {'dmabuf': 10, 'id': 15, 'ifindex': 3, 'type': 'rx'}, Suggested-by: Jakub Kicinski Signed-off-by: Mina Almasry Reviewed-by: Jakub Kicinski Link: https://patch.msgid.link/20240910171458.219195-14-almasrymina@google.com Signed-off-by: Jakub Kicinski --- Documentation/netlink/specs/netdev.yaml | 10 ++++++++++ include/uapi/linux/netdev.h | 2 ++ net/core/netdev-genl.c | 7 +++++++ net/core/page_pool_user.c | 5 +++++ tools/include/uapi/linux/netdev.h | 2 ++ 5 files changed, 26 insertions(+) (limited to 'include/uapi') diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml index 0c747530c275..08412c279297 100644 --- a/Documentation/netlink/specs/netdev.yaml +++ b/Documentation/netlink/specs/netdev.yaml @@ -167,6 +167,10 @@ attribute-sets: "re-attached", they are just waiting to disappear. Attribute is absent if Page Pool has not been detached, and can still be used to allocate new memory. + - + name: dmabuf + doc: ID of the dmabuf this page-pool is attached to. + type: u32 - name: page-pool-info subset-of: page-pool @@ -268,6 +272,10 @@ attribute-sets: name: napi-id doc: ID of the NAPI instance which services this queue. type: u32 + - + name: dmabuf + doc: ID of the dmabuf attached to this queue, if any. + type: u32 - name: qstats @@ -543,6 +551,7 @@ operations: - inflight - inflight-mem - detach-time + - dmabuf dump: reply: *pp-reply config-cond: page-pool @@ -607,6 +616,7 @@ operations: - type - napi-id - ifindex + - dmabuf dump: request: attributes: diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h index 91bf3ecc5f1d..7c308f04e7a0 100644 --- a/include/uapi/linux/netdev.h +++ b/include/uapi/linux/netdev.h @@ -93,6 +93,7 @@ enum { NETDEV_A_PAGE_POOL_INFLIGHT, NETDEV_A_PAGE_POOL_INFLIGHT_MEM, NETDEV_A_PAGE_POOL_DETACH_TIME, + NETDEV_A_PAGE_POOL_DMABUF, __NETDEV_A_PAGE_POOL_MAX, NETDEV_A_PAGE_POOL_MAX = (__NETDEV_A_PAGE_POOL_MAX - 1) @@ -131,6 +132,7 @@ enum { NETDEV_A_QUEUE_IFINDEX, NETDEV_A_QUEUE_TYPE, NETDEV_A_QUEUE_NAPI_ID, + NETDEV_A_QUEUE_DMABUF, __NETDEV_A_QUEUE_MAX, NETDEV_A_QUEUE_MAX = (__NETDEV_A_QUEUE_MAX - 1) diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c index 9153a8ab0cf8..1cb954f2d39e 100644 --- a/net/core/netdev-genl.c +++ b/net/core/netdev-genl.c @@ -295,6 +295,7 @@ static int netdev_nl_queue_fill_one(struct sk_buff *rsp, struct net_device *netdev, u32 q_idx, u32 q_type, const struct genl_info *info) { + struct net_devmem_dmabuf_binding *binding; struct netdev_rx_queue *rxq; struct netdev_queue *txq; void *hdr; @@ -314,6 +315,12 @@ netdev_nl_queue_fill_one(struct sk_buff *rsp, struct net_device *netdev, if (rxq->napi && nla_put_u32(rsp, NETDEV_A_QUEUE_NAPI_ID, rxq->napi->napi_id)) goto nla_put_failure; + + binding = rxq->mp_params.mp_priv; + if (binding && + nla_put_u32(rsp, NETDEV_A_QUEUE_DMABUF, binding->id)) + goto nla_put_failure; + break; case NETDEV_QUEUE_TYPE_TX: txq = netdev_get_tx_queue(netdev, q_idx); diff --git a/net/core/page_pool_user.c b/net/core/page_pool_user.c index cd6267ba6fa3..48335766c1bf 100644 --- a/net/core/page_pool_user.c +++ b/net/core/page_pool_user.c @@ -9,6 +9,7 @@ #include #include +#include "devmem.h" #include "page_pool_priv.h" #include "netdev-genl-gen.h" @@ -213,6 +214,7 @@ static int page_pool_nl_fill(struct sk_buff *rsp, const struct page_pool *pool, const struct genl_info *info) { + struct net_devmem_dmabuf_binding *binding = pool->mp_priv; size_t inflight, refsz; void *hdr; @@ -242,6 +244,9 @@ page_pool_nl_fill(struct sk_buff *rsp, const struct page_pool *pool, pool->user.detach_time)) goto err_cancel; + if (binding && nla_put_u32(rsp, NETDEV_A_PAGE_POOL_DMABUF, binding->id)) + goto err_cancel; + genlmsg_end(rsp, hdr); return 0; diff --git a/tools/include/uapi/linux/netdev.h b/tools/include/uapi/linux/netdev.h index 91bf3ecc5f1d..7c308f04e7a0 100644 --- a/tools/include/uapi/linux/netdev.h +++ b/tools/include/uapi/linux/netdev.h @@ -93,6 +93,7 @@ enum { NETDEV_A_PAGE_POOL_INFLIGHT, NETDEV_A_PAGE_POOL_INFLIGHT_MEM, NETDEV_A_PAGE_POOL_DETACH_TIME, + NETDEV_A_PAGE_POOL_DMABUF, __NETDEV_A_PAGE_POOL_MAX, NETDEV_A_PAGE_POOL_MAX = (__NETDEV_A_PAGE_POOL_MAX - 1) @@ -131,6 +132,7 @@ enum { NETDEV_A_QUEUE_IFINDEX, NETDEV_A_QUEUE_TYPE, NETDEV_A_QUEUE_NAPI_ID, + NETDEV_A_QUEUE_DMABUF, __NETDEV_A_QUEUE_MAX, NETDEV_A_QUEUE_MAX = (__NETDEV_A_QUEUE_MAX - 1) -- cgit v1.2.3 From 8f9bf857e43b3f75a098e3af3a6fec2d03203a1e Mon Sep 17 00:00:00 2001 From: Parthiban Veerasooran Date: Mon, 9 Sep 2024 13:55:06 +0530 Subject: net: ethernet: oa_tc6: implement internal PHY initialization Internal PHY is initialized as per the PHY register capability supported by the MAC-PHY. Direct PHY Register Access Capability indicates if PHY registers are directly accessible within the SPI register memory space. Indirect PHY Register Access Capability indicates if PHY registers are indirectly accessible through the MDIO/MDC registers MDIOACCn defined in OPEN Alliance specification. Currently the direct register access is only supported. Reviewed-by: Andrew Lunn Signed-off-by: Parthiban Veerasooran Link: https://patch.msgid.link/20240909082514.262942-7-Parthiban.Veerasooran@microchip.com Signed-off-by: Jakub Kicinski --- drivers/net/ethernet/oa_tc6.c | 230 +++++++++++++++++++++++++++++++++++++++++- include/linux/oa_tc6.h | 4 +- include/uapi/linux/mdio.h | 1 + 3 files changed, 233 insertions(+), 2 deletions(-) (limited to 'include/uapi') diff --git a/drivers/net/ethernet/oa_tc6.c b/drivers/net/ethernet/oa_tc6.c index 86b032cdbee1..fc276d881dc9 100644 --- a/drivers/net/ethernet/oa_tc6.c +++ b/drivers/net/ethernet/oa_tc6.c @@ -7,9 +7,15 @@ #include #include +#include +#include #include /* OPEN Alliance TC6 registers */ +/* Standard Capabilities Register */ +#define OA_TC6_REG_STDCAP 0x0002 +#define STDCAP_DIRECT_PHY_REG_ACCESS BIT(8) + /* Reset Control and Status Register */ #define OA_TC6_REG_RESET 0x0003 #define RESET_SWRESET BIT(0) /* Software Reset */ @@ -25,6 +31,10 @@ #define INT_MASK0_RX_BUFFER_OVERFLOW_ERR_MASK BIT(3) #define INT_MASK0_TX_PROTOCOL_ERR_MASK BIT(0) +/* PHY Clause 22 registers base address and mask */ +#define OA_TC6_PHY_STD_REG_ADDR_BASE 0xFF00 +#define OA_TC6_PHY_STD_REG_ADDR_MASK 0x1F + /* Control command header */ #define OA_TC6_CTRL_HEADER_DATA_NOT_CTRL BIT(31) #define OA_TC6_CTRL_HEADER_WRITE_NOT_READ BIT(29) @@ -33,6 +43,15 @@ #define OA_TC6_CTRL_HEADER_LENGTH GENMASK(7, 1) #define OA_TC6_CTRL_HEADER_PARITY BIT(0) +/* PHY – Clause 45 registers memory map selector (MMS) as per table 6 in the + * OPEN Alliance specification. + */ +#define OA_TC6_PHY_C45_PCS_MMS2 2 /* MMD 3 */ +#define OA_TC6_PHY_C45_PMA_PMD_MMS3 3 /* MMD 1 */ +#define OA_TC6_PHY_C45_VS_PLCA_MMS4 4 /* MMD 31 */ +#define OA_TC6_PHY_C45_AUTO_NEG_MMS5 5 /* MMD 7 */ +#define OA_TC6_PHY_C45_POWER_UNIT_MMS6 6 /* MMD 13 */ + #define OA_TC6_CTRL_HEADER_SIZE 4 #define OA_TC6_CTRL_REG_VALUE_SIZE 4 #define OA_TC6_CTRL_IGNORED_SIZE 4 @@ -46,6 +65,10 @@ /* Internal structure for MAC-PHY drivers */ struct oa_tc6 { + struct device *dev; + struct net_device *netdev; + struct phy_device *phydev; + struct mii_bus *mdiobus; struct spi_device *spi; struct mutex spi_ctrl_lock; /* Protects spi control transfer */ void *spi_ctrl_tx_buf; @@ -298,6 +321,191 @@ int oa_tc6_write_register(struct oa_tc6 *tc6, u32 address, u32 value) } EXPORT_SYMBOL_GPL(oa_tc6_write_register); +static int oa_tc6_check_phy_reg_direct_access_capability(struct oa_tc6 *tc6) +{ + u32 regval; + int ret; + + ret = oa_tc6_read_register(tc6, OA_TC6_REG_STDCAP, ®val); + if (ret) + return ret; + + if (!(regval & STDCAP_DIRECT_PHY_REG_ACCESS)) + return -ENODEV; + + return 0; +} + +static void oa_tc6_handle_link_change(struct net_device *netdev) +{ + phy_print_status(netdev->phydev); +} + +static int oa_tc6_mdiobus_read(struct mii_bus *bus, int addr, int regnum) +{ + struct oa_tc6 *tc6 = bus->priv; + u32 regval; + bool ret; + + ret = oa_tc6_read_register(tc6, OA_TC6_PHY_STD_REG_ADDR_BASE | + (regnum & OA_TC6_PHY_STD_REG_ADDR_MASK), + ®val); + if (ret) + return ret; + + return regval; +} + +static int oa_tc6_mdiobus_write(struct mii_bus *bus, int addr, int regnum, + u16 val) +{ + struct oa_tc6 *tc6 = bus->priv; + + return oa_tc6_write_register(tc6, OA_TC6_PHY_STD_REG_ADDR_BASE | + (regnum & OA_TC6_PHY_STD_REG_ADDR_MASK), + val); +} + +static int oa_tc6_get_phy_c45_mms(int devnum) +{ + switch (devnum) { + case MDIO_MMD_PCS: + return OA_TC6_PHY_C45_PCS_MMS2; + case MDIO_MMD_PMAPMD: + return OA_TC6_PHY_C45_PMA_PMD_MMS3; + case MDIO_MMD_VEND2: + return OA_TC6_PHY_C45_VS_PLCA_MMS4; + case MDIO_MMD_AN: + return OA_TC6_PHY_C45_AUTO_NEG_MMS5; + case MDIO_MMD_POWER_UNIT: + return OA_TC6_PHY_C45_POWER_UNIT_MMS6; + default: + return -EOPNOTSUPP; + } +} + +static int oa_tc6_mdiobus_read_c45(struct mii_bus *bus, int addr, int devnum, + int regnum) +{ + struct oa_tc6 *tc6 = bus->priv; + u32 regval; + int ret; + + ret = oa_tc6_get_phy_c45_mms(devnum); + if (ret < 0) + return ret; + + ret = oa_tc6_read_register(tc6, (ret << 16) | regnum, ®val); + if (ret) + return ret; + + return regval; +} + +static int oa_tc6_mdiobus_write_c45(struct mii_bus *bus, int addr, int devnum, + int regnum, u16 val) +{ + struct oa_tc6 *tc6 = bus->priv; + int ret; + + ret = oa_tc6_get_phy_c45_mms(devnum); + if (ret < 0) + return ret; + + return oa_tc6_write_register(tc6, (ret << 16) | regnum, val); +} + +static int oa_tc6_mdiobus_register(struct oa_tc6 *tc6) +{ + int ret; + + tc6->mdiobus = mdiobus_alloc(); + if (!tc6->mdiobus) { + netdev_err(tc6->netdev, "MDIO bus alloc failed\n"); + return -ENOMEM; + } + + tc6->mdiobus->priv = tc6; + tc6->mdiobus->read = oa_tc6_mdiobus_read; + tc6->mdiobus->write = oa_tc6_mdiobus_write; + /* OPEN Alliance 10BASE-T1x compliance MAC-PHYs will have both C22 and + * C45 registers space. If the PHY is discovered via C22 bus protocol it + * assumes it uses C22 protocol and always uses C22 registers indirect + * access to access C45 registers. This is because, we don't have a + * clean separation between C22/C45 register space and C22/C45 MDIO bus + * protocols. Resulting, PHY C45 registers direct access can't be used + * which can save multiple SPI bus access. To support this feature, PHY + * drivers can set .read_mmd/.write_mmd in the PHY driver to call + * .read_c45/.write_c45. Ex: drivers/net/phy/microchip_t1s.c + */ + tc6->mdiobus->read_c45 = oa_tc6_mdiobus_read_c45; + tc6->mdiobus->write_c45 = oa_tc6_mdiobus_write_c45; + tc6->mdiobus->name = "oa-tc6-mdiobus"; + tc6->mdiobus->parent = tc6->dev; + + snprintf(tc6->mdiobus->id, ARRAY_SIZE(tc6->mdiobus->id), "%s", + dev_name(&tc6->spi->dev)); + + ret = mdiobus_register(tc6->mdiobus); + if (ret) { + netdev_err(tc6->netdev, "Could not register MDIO bus\n"); + mdiobus_free(tc6->mdiobus); + return ret; + } + + return 0; +} + +static void oa_tc6_mdiobus_unregister(struct oa_tc6 *tc6) +{ + mdiobus_unregister(tc6->mdiobus); + mdiobus_free(tc6->mdiobus); +} + +static int oa_tc6_phy_init(struct oa_tc6 *tc6) +{ + int ret; + + ret = oa_tc6_check_phy_reg_direct_access_capability(tc6); + if (ret) { + netdev_err(tc6->netdev, + "Direct PHY register access is not supported by the MAC-PHY\n"); + return ret; + } + + ret = oa_tc6_mdiobus_register(tc6); + if (ret) + return ret; + + tc6->phydev = phy_find_first(tc6->mdiobus); + if (!tc6->phydev) { + netdev_err(tc6->netdev, "No PHY found\n"); + oa_tc6_mdiobus_unregister(tc6); + return -ENODEV; + } + + tc6->phydev->is_internal = true; + ret = phy_connect_direct(tc6->netdev, tc6->phydev, + &oa_tc6_handle_link_change, + PHY_INTERFACE_MODE_INTERNAL); + if (ret) { + netdev_err(tc6->netdev, "Can't attach PHY to %s\n", + tc6->mdiobus->id); + oa_tc6_mdiobus_unregister(tc6); + return ret; + } + + phy_attached_info(tc6->netdev->phydev); + + return 0; +} + +static void oa_tc6_phy_exit(struct oa_tc6 *tc6) +{ + phy_disconnect(tc6->phydev); + oa_tc6_mdiobus_unregister(tc6); +} + static int oa_tc6_read_status0(struct oa_tc6 *tc6) { u32 regval; @@ -354,11 +562,12 @@ static int oa_tc6_unmask_macphy_error_interrupts(struct oa_tc6 *tc6) /** * oa_tc6_init - allocates and initializes oa_tc6 structure. * @spi: device with which data will be exchanged. + * @netdev: network device interface structure. * * Return: pointer reference to the oa_tc6 structure if the MAC-PHY * initialization is successful otherwise NULL. */ -struct oa_tc6 *oa_tc6_init(struct spi_device *spi) +struct oa_tc6 *oa_tc6_init(struct spi_device *spi, struct net_device *netdev) { struct oa_tc6 *tc6; int ret; @@ -368,6 +577,8 @@ struct oa_tc6 *oa_tc6_init(struct spi_device *spi) return NULL; tc6->spi = spi; + tc6->netdev = netdev; + SET_NETDEV_DEV(netdev, &spi->dev); mutex_init(&tc6->spi_ctrl_lock); /* Set the SPI controller to pump at realtime priority */ @@ -400,10 +611,27 @@ struct oa_tc6 *oa_tc6_init(struct spi_device *spi) return NULL; } + ret = oa_tc6_phy_init(tc6); + if (ret) { + dev_err(&tc6->spi->dev, + "MAC internal PHY initialization failed: %d\n", ret); + return NULL; + } + return tc6; } EXPORT_SYMBOL_GPL(oa_tc6_init); +/** + * oa_tc6_exit - exit function. + * @tc6: oa_tc6 struct. + */ +void oa_tc6_exit(struct oa_tc6 *tc6) +{ + oa_tc6_phy_exit(tc6); +} +EXPORT_SYMBOL_GPL(oa_tc6_exit); + MODULE_DESCRIPTION("OPEN Alliance 10BASE‑T1x MAC‑PHY Serial Interface Lib"); MODULE_AUTHOR("Parthiban Veerasooran "); MODULE_LICENSE("GPL"); diff --git a/include/linux/oa_tc6.h b/include/linux/oa_tc6.h index 85aeecf87306..606ba9f1e663 100644 --- a/include/linux/oa_tc6.h +++ b/include/linux/oa_tc6.h @@ -7,11 +7,13 @@ * Author: Parthiban Veerasooran */ +#include #include struct oa_tc6; -struct oa_tc6 *oa_tc6_init(struct spi_device *spi); +struct oa_tc6 *oa_tc6_init(struct spi_device *spi, struct net_device *netdev); +void oa_tc6_exit(struct oa_tc6 *tc6); int oa_tc6_write_register(struct oa_tc6 *tc6, u32 address, u32 value); int oa_tc6_write_registers(struct oa_tc6 *tc6, u32 address, u32 value[], u8 length); diff --git a/include/uapi/linux/mdio.h b/include/uapi/linux/mdio.h index c0c8ec995b06..f0d3f268240d 100644 --- a/include/uapi/linux/mdio.h +++ b/include/uapi/linux/mdio.h @@ -23,6 +23,7 @@ #define MDIO_MMD_DTEXS 5 /* DTE Extender Sublayer */ #define MDIO_MMD_TC 6 /* Transmission Convergence */ #define MDIO_MMD_AN 7 /* Auto-Negotiation */ +#define MDIO_MMD_POWER_UNIT 13 /* PHY Power Unit */ #define MDIO_MMD_C22EXT 29 /* Clause 22 extension */ #define MDIO_MMD_VEND1 30 /* Vendor specific 1 */ #define MDIO_MMD_VEND2 31 /* Vendor specific 2 */ -- cgit v1.2.3 From 7cc2a6eadcd7a5aa36ac63e6659f5c6138c7f4d2 Mon Sep 17 00:00:00 2001 From: Jens Axboe Date: Wed, 11 Sep 2024 13:56:08 -0600 Subject: io_uring: add IORING_REGISTER_COPY_BUFFERS method Buffers can get registered with io_uring, which allows to skip the repeated pin_pages, unpin/unref pages for each O_DIRECT operation. This reduces the overhead of O_DIRECT IO. However, registrering buffers can take some time. Normally this isn't an issue as it's done at initialization time (and hence less critical), but for cases where rings can be created and destroyed as part of an IO thread pool, registering the same buffers for multiple rings become a more time sensitive proposition. As an example, let's say an application has an IO memory pool of 500G. Initial registration takes: Got 500 huge pages (each 1024MB) Registered 500 pages in 409 msec or about 0.4 seconds. If we go higher to 900 1GB huge pages being registered: Registered 900 pages in 738 msec which is, as expected, a fully linear scaling. Rather than have each ring pin/map/register the same buffer pool, provide an io_uring_register(2) opcode to simply duplicate the buffers that are registered with another ring. Adding the same 900GB of registered buffers to the target ring can then be accomplished in: Copied 900 pages in 17 usec While timing differs a bit, this provides around a 25,000-40,000x speedup for this use case. Signed-off-by: Jens Axboe --- include/uapi/linux/io_uring.h | 13 +++++++ io_uring/register.c | 6 +++ io_uring/rsrc.c | 91 +++++++++++++++++++++++++++++++++++++++++++ io_uring/rsrc.h | 1 + 4 files changed, 111 insertions(+) (limited to 'include/uapi') diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index a275f91d2ac0..9dc5bb428c8a 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -609,6 +609,9 @@ enum io_uring_register_op { IORING_REGISTER_CLOCK = 29, + /* copy registered buffers from source ring to current ring */ + IORING_REGISTER_COPY_BUFFERS = 30, + /* this goes last */ IORING_REGISTER_LAST, @@ -694,6 +697,16 @@ struct io_uring_clock_register { __u32 __resv[3]; }; +enum { + IORING_REGISTER_SRC_REGISTERED = 1, +}; + +struct io_uring_copy_buffers { + __u32 src_fd; + __u32 flags; + __u32 pad[6]; +}; + struct io_uring_buf { __u64 addr; __u32 len; diff --git a/io_uring/register.c b/io_uring/register.c index d90159478045..dab0f8024ddf 100644 --- a/io_uring/register.c +++ b/io_uring/register.c @@ -542,6 +542,12 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode, break; ret = io_register_clock(ctx, arg); break; + case IORING_REGISTER_COPY_BUFFERS: + ret = -EINVAL; + if (!arg || nr_args != 1) + break; + ret = io_register_copy_buffers(ctx, arg); + break; default: ret = -EINVAL; break; diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c index 28f98de3c304..40696a395f0a 100644 --- a/io_uring/rsrc.c +++ b/io_uring/rsrc.c @@ -17,6 +17,7 @@ #include "openclose.h" #include "rsrc.h" #include "memmap.h" +#include "register.h" struct io_rsrc_update { struct file *file; @@ -1137,3 +1138,93 @@ int io_import_fixed(int ddir, struct iov_iter *iter, return 0; } + +static int io_copy_buffers(struct io_ring_ctx *ctx, struct io_ring_ctx *src_ctx) +{ + struct io_mapped_ubuf **user_bufs; + struct io_rsrc_data *data; + int i, ret, nbufs; + + /* + * Drop our own lock here. We'll setup the data we need and reference + * the source buffers, then re-grab, check, and assign at the end. + */ + mutex_unlock(&ctx->uring_lock); + + mutex_lock(&src_ctx->uring_lock); + ret = -ENXIO; + nbufs = src_ctx->nr_user_bufs; + if (!nbufs) + goto out_unlock; + ret = io_rsrc_data_alloc(ctx, IORING_RSRC_BUFFER, NULL, nbufs, &data); + if (ret) + goto out_unlock; + + ret = -ENOMEM; + user_bufs = kcalloc(nbufs, sizeof(*ctx->user_bufs), GFP_KERNEL); + if (!user_bufs) + goto out_free_data; + + for (i = 0; i < nbufs; i++) { + struct io_mapped_ubuf *src = src_ctx->user_bufs[i]; + + refcount_inc(&src->refs); + user_bufs[i] = src; + } + + /* Have a ref on the bufs now, drop src lock and re-grab our own lock */ + mutex_unlock(&src_ctx->uring_lock); + mutex_lock(&ctx->uring_lock); + if (!ctx->user_bufs) { + ctx->user_bufs = user_bufs; + ctx->buf_data = data; + ctx->nr_user_bufs = nbufs; + return 0; + } + + /* someone raced setting up buffers, dump ours */ + for (i = 0; i < nbufs; i++) + io_buffer_unmap(ctx, &user_bufs[i]); + io_rsrc_data_free(data); + kfree(user_bufs); + return -EBUSY; +out_free_data: + io_rsrc_data_free(data); +out_unlock: + mutex_unlock(&src_ctx->uring_lock); + mutex_lock(&ctx->uring_lock); + return ret; +} + +/* + * Copy the registered buffers from the source ring whose file descriptor + * is given in the src_fd to the current ring. This is identical to registering + * the buffers with ctx, except faster as mappings already exist. + * + * Since the memory is already accounted once, don't account it again. + */ +int io_register_copy_buffers(struct io_ring_ctx *ctx, void __user *arg) +{ + struct io_uring_copy_buffers buf; + bool registered_src; + struct file *file; + int ret; + + if (ctx->user_bufs || ctx->nr_user_bufs) + return -EBUSY; + if (copy_from_user(&buf, arg, sizeof(buf))) + return -EFAULT; + if (buf.flags & ~IORING_REGISTER_SRC_REGISTERED) + return -EINVAL; + if (memchr_inv(buf.pad, 0, sizeof(buf.pad))) + return -EINVAL; + + registered_src = (buf.flags & IORING_REGISTER_SRC_REGISTERED) != 0; + file = io_uring_register_get_file(buf.src_fd, registered_src); + if (IS_ERR(file)) + return PTR_ERR(file); + ret = io_copy_buffers(ctx, file->private_data); + if (!registered_src) + fput(file); + return ret; +} diff --git a/io_uring/rsrc.h b/io_uring/rsrc.h index 98a253172c27..93546ab337a6 100644 --- a/io_uring/rsrc.h +++ b/io_uring/rsrc.h @@ -68,6 +68,7 @@ int io_import_fixed(int ddir, struct iov_iter *iter, struct io_mapped_ubuf *imu, u64 buf_addr, size_t len); +int io_register_copy_buffers(struct io_ring_ctx *ctx, void __user *arg); void __io_sqe_buffers_unregister(struct io_ring_ctx *ctx); int io_sqe_buffers_unregister(struct io_ring_ctx *ctx); int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg, -- cgit v1.2.3 From b2155807893aac40f1a1cdf43f7fcc270cbfc05a Mon Sep 17 00:00:00 2001 From: Jakub Kicinski Date: Tue, 10 Sep 2024 17:21:42 -0700 Subject: uapi: libc-compat: remove ipx leftovers The uAPI headers for IPX were deleted 3 years ago in commit 6c9b40844751 ("net: Remove net/ipx.h and uapi/linux/ipx.h header files") Delete the leftover defines from libc-compat.h Link: https://patch.msgid.link/20240911002142.1508694-1-kuba@kernel.org Signed-off-by: Jakub Kicinski --- include/uapi/linux/libc-compat.h | 36 ------------------------------------ 1 file changed, 36 deletions(-) (limited to 'include/uapi') diff --git a/include/uapi/linux/libc-compat.h b/include/uapi/linux/libc-compat.h index 8254c937c9f4..0eca95ccb41e 100644 --- a/include/uapi/linux/libc-compat.h +++ b/include/uapi/linux/libc-compat.h @@ -140,25 +140,6 @@ #endif /* _NETINET_IN_H */ -/* Coordinate with glibc netipx/ipx.h header. */ -#if defined(__NETIPX_IPX_H) - -#define __UAPI_DEF_SOCKADDR_IPX 0 -#define __UAPI_DEF_IPX_ROUTE_DEFINITION 0 -#define __UAPI_DEF_IPX_INTERFACE_DEFINITION 0 -#define __UAPI_DEF_IPX_CONFIG_DATA 0 -#define __UAPI_DEF_IPX_ROUTE_DEF 0 - -#else /* defined(__NETIPX_IPX_H) */ - -#define __UAPI_DEF_SOCKADDR_IPX 1 -#define __UAPI_DEF_IPX_ROUTE_DEFINITION 1 -#define __UAPI_DEF_IPX_INTERFACE_DEFINITION 1 -#define __UAPI_DEF_IPX_CONFIG_DATA 1 -#define __UAPI_DEF_IPX_ROUTE_DEF 1 - -#endif /* defined(__NETIPX_IPX_H) */ - /* Definitions for xattr.h */ #if defined(_SYS_XATTR_H) #define __UAPI_DEF_XATTR 0 @@ -240,23 +221,6 @@ #define __UAPI_DEF_IP6_MTUINFO 1 #endif -/* Definitions for ipx.h */ -#ifndef __UAPI_DEF_SOCKADDR_IPX -#define __UAPI_DEF_SOCKADDR_IPX 1 -#endif -#ifndef __UAPI_DEF_IPX_ROUTE_DEFINITION -#define __UAPI_DEF_IPX_ROUTE_DEFINITION 1 -#endif -#ifndef __UAPI_DEF_IPX_INTERFACE_DEFINITION -#define __UAPI_DEF_IPX_INTERFACE_DEFINITION 1 -#endif -#ifndef __UAPI_DEF_IPX_CONFIG_DATA -#define __UAPI_DEF_IPX_CONFIG_DATA 1 -#endif -#ifndef __UAPI_DEF_IPX_ROUTE_DEF -#define __UAPI_DEF_IPX_ROUTE_DEF 1 -#endif - /* Definitions for xattr.h */ #ifndef __UAPI_DEF_XATTR #define __UAPI_DEF_XATTR 1 -- cgit v1.2.3 From 9cbed5aab5aeea420d0aa945733bf608449d44fb Mon Sep 17 00:00:00 2001 From: Chiara Meiohas Date: Mon, 9 Sep 2024 20:30:24 +0300 Subject: RDMA/nldev: Add support for RDMA monitoring Introduce a new netlink command to allow rdma event monitoring. The rdma events supported now are IB device registration/unregistration and net device attachment/detachment. Example output of rdma monitor and the commands which trigger the events: $ rdma monitor $ rmmod mlx5_ib [UNREGISTER] dev 1 rocep8s0f1 [UNREGISTER] dev 0 rocep8s0f0 $ modprobe mlx5_ib [REGISTER] dev 2 mlx5_0 [NETDEV_ATTACH] dev 2 mlx5_0 port 1 netdev 4 eth2 [REGISTER] dev 3 mlx5_1 [NETDEV_ATTACH] dev 3 mlx5_1 port 1 netdev 5 eth3 $ devlink dev eswitch set pci/0000:08:00.0 mode switchdev [UNREGISTER] dev 2 rocep8s0f0 [REGISTER] dev 4 mlx5_0 [NETDEV_ATTACH] dev 4 mlx5_0 port 30 netdev 4 eth2 $ echo 4 > /sys/class/net/eth2/device/sriov_numvfs [NETDEV_ATTACH] dev 4 rdmap8s0f0 port 2 netdev 7 eth4 [NETDEV_ATTACH] dev 4 rdmap8s0f0 port 3 netdev 8 eth5 [NETDEV_ATTACH] dev 4 rdmap8s0f0 port 4 netdev 9 eth6 [NETDEV_ATTACH] dev 4 rdmap8s0f0 port 5 netdev 10 eth7 [REGISTER] dev 5 mlx5_0 [NETDEV_ATTACH] dev 5 mlx5_0 port 1 netdev 11 eth8 [REGISTER] dev 6 mlx5_0 [NETDEV_ATTACH] dev 6 mlx5_0 port 1 netdev 12 eth9 [REGISTER] dev 7 mlx5_0 [NETDEV_ATTACH] dev 7 mlx5_0 port 1 netdev 13 eth10 [REGISTER] dev 8 mlx5_0 [NETDEV_ATTACH] dev 8 mlx5_0 port 1 netdev 14 eth11 $ echo 0 > /sys/class/net/eth2/device/sriov_numvfs [UNREGISTER] dev 5 rocep8s0f0v0 [UNREGISTER] dev 6 rocep8s0f0v1 [UNREGISTER] dev 7 rocep8s0f0v2 [UNREGISTER] dev 8 rocep8s0f0v3 [NETDEV_DETACH] dev 4 rdmap8s0f0 port 2 [NETDEV_DETACH] dev 4 rdmap8s0f0 port 3 [NETDEV_DETACH] dev 4 rdmap8s0f0 port 4 [NETDEV_DETACH] dev 4 rdmap8s0f0 port 5 Signed-off-by: Chiara Meiohas Signed-off-by: Michael Guralnik Link: https://patch.msgid.link/20240909173025.30422-7-michaelgur@nvidia.com Signed-off-by: Leon Romanovsky --- drivers/infiniband/core/device.c | 35 +++++++++++ drivers/infiniband/core/netlink.c | 1 + drivers/infiniband/core/nldev.c | 124 ++++++++++++++++++++++++++++++++++++++ include/rdma/rdma_netlink.h | 12 ++++ include/uapi/rdma/rdma_netlink.h | 15 +++++ 5 files changed, 187 insertions(+) (limited to 'include/uapi') diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c index 9e765c79a892..e029401b5680 100644 --- a/drivers/infiniband/core/device.c +++ b/drivers/infiniband/core/device.c @@ -1351,6 +1351,29 @@ static void prevent_dealloc_device(struct ib_device *ib_dev) { } +static void ib_device_notify_register(struct ib_device *device) +{ + struct net_device *netdev; + u32 port; + int ret; + + ret = rdma_nl_notify_event(device, 0, RDMA_REGISTER_EVENT); + if (ret) + return; + + rdma_for_each_port(device, port) { + netdev = ib_device_get_netdev(device, port); + if (!netdev) + continue; + + ret = rdma_nl_notify_event(device, port, + RDMA_NETDEV_ATTACH_EVENT); + dev_put(netdev); + if (ret) + return; + } +} + /** * ib_register_device - Register an IB device with IB core * @device: Device to register @@ -1449,6 +1472,8 @@ int ib_register_device(struct ib_device *device, const char *name, dev_set_uevent_suppress(&device->dev, false); /* Mark for userspace that device is ready */ kobject_uevent(&device->dev.kobj, KOBJ_ADD); + + ib_device_notify_register(device); ib_device_put(device); return 0; @@ -1491,6 +1516,7 @@ static void __ib_unregister_device(struct ib_device *ib_dev) goto out; disable_device(ib_dev); + rdma_nl_notify_event(ib_dev, 0, RDMA_UNREGISTER_EVENT); /* Expedite removing unregistered pointers from the hash table */ free_netdevs(ib_dev); @@ -2159,6 +2185,7 @@ static void add_ndev_hash(struct ib_port_data *pdata) int ib_device_set_netdev(struct ib_device *ib_dev, struct net_device *ndev, u32 port) { + enum rdma_nl_notify_event_type etype; struct net_device *old_ndev; struct ib_port_data *pdata; unsigned long flags; @@ -2190,6 +2217,14 @@ int ib_device_set_netdev(struct ib_device *ib_dev, struct net_device *ndev, spin_unlock_irqrestore(&pdata->netdev_lock, flags); add_ndev_hash(pdata); + + /* Make sure that the device is registered before we send events */ + if (xa_load(&devices, ib_dev->index) != ib_dev) + return 0; + + etype = ndev ? RDMA_NETDEV_ATTACH_EVENT : RDMA_NETDEV_DETACH_EVENT; + rdma_nl_notify_event(ib_dev, port, etype); + return 0; } EXPORT_SYMBOL(ib_device_set_netdev); diff --git a/drivers/infiniband/core/netlink.c b/drivers/infiniband/core/netlink.c index ae2db0c70788..def14c54b648 100644 --- a/drivers/infiniband/core/netlink.c +++ b/drivers/infiniband/core/netlink.c @@ -311,6 +311,7 @@ int rdma_nl_net_init(struct rdma_dev_net *rnet) struct net *net = read_pnet(&rnet->net); struct netlink_kernel_cfg cfg = { .input = rdma_nl_rcv, + .flags = NL_CFG_F_NONROOT_RECV, }; struct sock *nls; diff --git a/drivers/infiniband/core/nldev.c b/drivers/infiniband/core/nldev.c index 4d4a1f90e484..70b3fa0469f2 100644 --- a/drivers/infiniband/core/nldev.c +++ b/drivers/infiniband/core/nldev.c @@ -170,6 +170,7 @@ static const struct nla_policy nldev_policy[RDMA_NLDEV_ATTR_MAX] = { [RDMA_NLDEV_ATTR_DEV_TYPE] = { .type = NLA_U8 }, [RDMA_NLDEV_ATTR_PARENT_NAME] = { .type = NLA_NUL_STRING }, [RDMA_NLDEV_ATTR_NAME_ASSIGN_TYPE] = { .type = NLA_U8 }, + [RDMA_NLDEV_ATTR_EVENT_TYPE] = { .type = NLA_U8 }, }; static int put_driver_name_print_type(struct sk_buff *msg, const char *name, @@ -2722,6 +2723,129 @@ static const struct rdma_nl_cbs nldev_cb_table[RDMA_NLDEV_NUM_OPS] = { }, }; +static int fill_mon_netdev_association(struct sk_buff *msg, + struct ib_device *device, u32 port, + const struct net *net) +{ + struct net_device *netdev = ib_device_get_netdev(device, port); + int ret = 0; + + if (netdev && !net_eq(dev_net(netdev), net)) + goto out; + + ret = nla_put_u32(msg, RDMA_NLDEV_ATTR_DEV_INDEX, device->index); + if (ret) + goto out; + + ret = nla_put_string(msg, RDMA_NLDEV_ATTR_DEV_NAME, + dev_name(&device->dev)); + if (ret) + goto out; + + ret = nla_put_u32(msg, RDMA_NLDEV_ATTR_PORT_INDEX, port); + if (ret) + goto out; + + if (netdev) { + ret = nla_put_u32(msg, + RDMA_NLDEV_ATTR_NDEV_INDEX, netdev->ifindex); + if (ret) + goto out; + + ret = nla_put_string(msg, + RDMA_NLDEV_ATTR_NDEV_NAME, netdev->name); + } + +out: + dev_put(netdev); + return ret; +} + +static void rdma_nl_notify_err_msg(struct ib_device *device, u32 port_num, + enum rdma_nl_notify_event_type type) +{ + struct net_device *netdev; + + switch (type) { + case RDMA_REGISTER_EVENT: + dev_warn_ratelimited(&device->dev, + "Failed to send RDMA monitor register device event\n"); + break; + case RDMA_UNREGISTER_EVENT: + dev_warn_ratelimited(&device->dev, + "Failed to send RDMA monitor unregister device event\n"); + break; + case RDMA_NETDEV_ATTACH_EVENT: + netdev = ib_device_get_netdev(device, port_num); + dev_warn_ratelimited(&device->dev, + "Failed to send RDMA monitor netdev attach event: port %d netdev %d\n", + port_num, netdev->ifindex); + dev_put(netdev); + break; + case RDMA_NETDEV_DETACH_EVENT: + dev_warn_ratelimited(&device->dev, + "Failed to send RDMA monitor netdev detach event: port %d\n", + port_num); + default: + break; + } +} + +int rdma_nl_notify_event(struct ib_device *device, u32 port_num, + enum rdma_nl_notify_event_type type) +{ + struct sk_buff *skb; + struct net *net; + int ret = 0; + void *nlh; + + net = read_pnet(&device->coredev.rdma_net); + if (!net) + return -EINVAL; + + skb = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL); + if (!skb) + return -ENOMEM; + nlh = nlmsg_put(skb, 0, 0, + RDMA_NL_GET_TYPE(RDMA_NL_NLDEV, RDMA_NLDEV_CMD_MONITOR), + 0, 0); + + switch (type) { + case RDMA_REGISTER_EVENT: + case RDMA_UNREGISTER_EVENT: + ret = fill_nldev_handle(skb, device); + if (ret) + goto err_free; + break; + case RDMA_NETDEV_ATTACH_EVENT: + case RDMA_NETDEV_DETACH_EVENT: + ret = fill_mon_netdev_association(skb, device, + port_num, net); + if (ret) + goto err_free; + break; + default: + break; + } + + ret = nla_put_u8(skb, RDMA_NLDEV_ATTR_EVENT_TYPE, type); + if (ret) + goto err_free; + + nlmsg_end(skb, nlh); + ret = rdma_nl_multicast(net, skb, RDMA_NL_GROUP_NOTIFY, GFP_KERNEL); + if (ret && ret != -ESRCH) { + skb = NULL; /* skb is freed in the netlink send-op handling */ + goto err_free; + } + return 0; + +err_free: + rdma_nl_notify_err_msg(device, port_num, type); + nlmsg_free(skb); + return ret; +} + void __init nldev_init(void) { rdma_nl_register(RDMA_NL_NLDEV, nldev_cb_table); diff --git a/include/rdma/rdma_netlink.h b/include/rdma/rdma_netlink.h index c2a79aeee113..326deaf56d5d 100644 --- a/include/rdma/rdma_netlink.h +++ b/include/rdma/rdma_netlink.h @@ -6,6 +6,8 @@ #include #include +struct ib_device; + enum { RDMA_NLDEV_ATTR_EMPTY_STRING = 1, RDMA_NLDEV_ATTR_ENTRY_STRLEN = 16, @@ -110,6 +112,16 @@ int rdma_nl_multicast(struct net *net, struct sk_buff *skb, */ bool rdma_nl_chk_listeners(unsigned int group); +/** + * Prepare and send an event message + * @ib: the IB device which triggered the event + * @port_num: the port number which triggered the event - 0 if unused + * @type: the event type + * Returns 0 on success or a negative error code + */ +int rdma_nl_notify_event(struct ib_device *ib, u32 port_num, + enum rdma_nl_notify_event_type type); + struct rdma_link_ops { struct list_head list; const char *type; diff --git a/include/uapi/rdma/rdma_netlink.h b/include/uapi/rdma/rdma_netlink.h index 2f37568f5556..5f9636d26050 100644 --- a/include/uapi/rdma/rdma_netlink.h +++ b/include/uapi/rdma/rdma_netlink.h @@ -15,6 +15,7 @@ enum { enum { RDMA_NL_GROUP_IWPM = 2, RDMA_NL_GROUP_LS, + RDMA_NL_GROUP_NOTIFY, RDMA_NL_NUM_GROUPS }; @@ -305,6 +306,8 @@ enum rdma_nldev_command { RDMA_NLDEV_CMD_DELDEV, + RDMA_NLDEV_CMD_MONITOR, + RDMA_NLDEV_NUM_OPS }; @@ -574,6 +577,8 @@ enum rdma_nldev_attr { RDMA_NLDEV_ATTR_NAME_ASSIGN_TYPE, /* u8 */ + RDMA_NLDEV_ATTR_EVENT_TYPE, /* u8 */ + /* * Always the end */ @@ -624,4 +629,14 @@ enum rdma_nl_name_assign_type { RDMA_NAME_ASSIGN_TYPE_USER = 1, /* Provided by user-space */ }; +/* + * Supported rdma monitoring event types. + */ +enum rdma_nl_notify_event_type { + RDMA_REGISTER_EVENT, + RDMA_UNREGISTER_EVENT, + RDMA_NETDEV_ATTACH_EVENT, + RDMA_NETDEV_DETACH_EVENT, +}; + #endif /* _UAPI_RDMA_NETLINK_H */ -- cgit v1.2.3 From 12fb1153c53bf9b53e299c9775b84fa7838640f7 Mon Sep 17 00:00:00 2001 From: Chiara Meiohas Date: Mon, 9 Sep 2024 20:30:25 +0300 Subject: RDMA/nldev: Expose whether RDMA monitoring is supported Extend the "rdma sys" command to display whether RDMA monitoring is supported. RDMA monitoring is not supported in mlx4 because it does not use the ib_device_set_netdev() API, which sends the RDMA events. Example output for kernel where monitoring is supported: $ rdma sys show netns shared privileged-qkey off monitor on copy-on-fork on Example output for kernel where monitoring is not supported: $ rdma sys show netns shared privileged-qkey off monitor off copy-on-fork on Signed-off-by: Chiara Meiohas Signed-off-by: Michael Guralnik Link: https://patch.msgid.link/20240909173025.30422-8-michaelgur@nvidia.com Signed-off-by: Leon Romanovsky --- drivers/infiniband/core/nldev.c | 6 ++++++ include/uapi/rdma/rdma_netlink.h | 1 + 2 files changed, 7 insertions(+) (limited to 'include/uapi') diff --git a/drivers/infiniband/core/nldev.c b/drivers/infiniband/core/nldev.c index 70b3fa0469f2..10b1411ac53d 100644 --- a/drivers/infiniband/core/nldev.c +++ b/drivers/infiniband/core/nldev.c @@ -1952,6 +1952,12 @@ static int nldev_sys_get_doit(struct sk_buff *skb, struct nlmsghdr *nlh, nlmsg_free(msg); return err; } + + err = nla_put_u8(msg, RDMA_NLDEV_SYS_ATTR_MONITOR_MODE, 1); + if (err) { + nlmsg_free(msg); + return err; + } /* * Copy-on-fork is supported. * See commits: diff --git a/include/uapi/rdma/rdma_netlink.h b/include/uapi/rdma/rdma_netlink.h index 5f9636d26050..39be09c0ffbb 100644 --- a/include/uapi/rdma/rdma_netlink.h +++ b/include/uapi/rdma/rdma_netlink.h @@ -579,6 +579,7 @@ enum rdma_nldev_attr { RDMA_NLDEV_ATTR_EVENT_TYPE, /* u8 */ + RDMA_NLDEV_SYS_ATTR_MONITOR_MODE, /* u8 */ /* * Always the end */ -- cgit v1.2.3 From c951a29f6ba52b86223eb00bbcff43142d59a901 Mon Sep 17 00:00:00 2001 From: Ido Schimmel Date: Wed, 11 Sep 2024 12:37:43 +0300 Subject: net: fib_rules: Add DSCP selector attribute The FIB rule TOS selector is implemented differently between IPv4 and IPv6. In IPv4 it is used to match on the three "Type of Services" bits specified in RFC 791, while in IPv6 is it is used to match on the six DSCP bits specified in RFC 2474. Add a new FIB rule attribute to allow matching on DSCP. The attribute will be used to implement a 'dscp' selector in ip-rule with a consistent behavior between IPv4 and IPv6. For now, set the type of the attribute to 'NLA_REJECT' so that user space will not be able to configure it. This restriction will be lifted once both IPv4 and IPv6 support the new attribute. Signed-off-by: Ido Schimmel Reviewed-by: Guillaume Nault Reviewed-by: David Ahern Link: https://patch.msgid.link/20240911093748.3662015-2-idosch@nvidia.com Signed-off-by: Jakub Kicinski --- include/uapi/linux/fib_rules.h | 1 + net/core/fib_rules.c | 3 ++- 2 files changed, 3 insertions(+), 1 deletion(-) (limited to 'include/uapi') diff --git a/include/uapi/linux/fib_rules.h b/include/uapi/linux/fib_rules.h index 232df14e1287..a6924dd3aff1 100644 --- a/include/uapi/linux/fib_rules.h +++ b/include/uapi/linux/fib_rules.h @@ -67,6 +67,7 @@ enum { FRA_IP_PROTO, /* ip proto */ FRA_SPORT_RANGE, /* sport */ FRA_DPORT_RANGE, /* dport */ + FRA_DSCP, /* dscp */ __FRA_MAX }; diff --git a/net/core/fib_rules.c b/net/core/fib_rules.c index 5a4eb744758c..df41c05f7234 100644 --- a/net/core/fib_rules.c +++ b/net/core/fib_rules.c @@ -766,7 +766,8 @@ static const struct nla_policy fib_rule_policy[FRA_MAX + 1] = { [FRA_PROTOCOL] = { .type = NLA_U8 }, [FRA_IP_PROTO] = { .type = NLA_U8 }, [FRA_SPORT_RANGE] = { .len = sizeof(struct fib_rule_port_range) }, - [FRA_DPORT_RANGE] = { .len = sizeof(struct fib_rule_port_range) } + [FRA_DPORT_RANGE] = { .len = sizeof(struct fib_rule_port_range) }, + [FRA_DSCP] = { .type = NLA_REJECT }, }; int fib_nl_newrule(struct sk_buff *skb, struct nlmsghdr *nlh, -- cgit v1.2.3 From 636119af94f2fbf3e4458be66a1bc740ba69ce6d Mon Sep 17 00:00:00 2001 From: Jens Axboe Date: Sat, 14 Sep 2024 08:51:15 -0600 Subject: io_uring: rename "copy buffers" to "clone buffers" A recent commit added support for copying registered buffers from one ring to another. But that term is a bit confusing, as no copying of buffer data is done here. What is being done is simply cloning the buffer registrations from one ring to another. Rename it while we still can, so that it's more descriptive. No functional changes in this patch. Fixes: 7cc2a6eadcd7 ("io_uring: add IORING_REGISTER_COPY_BUFFERS method") Signed-off-by: Jens Axboe --- include/uapi/linux/io_uring.h | 6 +++--- io_uring/register.c | 4 ++-- io_uring/rsrc.c | 8 ++++---- io_uring/rsrc.h | 2 +- 4 files changed, 10 insertions(+), 10 deletions(-) (limited to 'include/uapi') diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 9dc5bb428c8a..1fe79e750470 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -609,8 +609,8 @@ enum io_uring_register_op { IORING_REGISTER_CLOCK = 29, - /* copy registered buffers from source ring to current ring */ - IORING_REGISTER_COPY_BUFFERS = 30, + /* clone registered buffers from source ring to current ring */ + IORING_REGISTER_CLONE_BUFFERS = 30, /* this goes last */ IORING_REGISTER_LAST, @@ -701,7 +701,7 @@ enum { IORING_REGISTER_SRC_REGISTERED = 1, }; -struct io_uring_copy_buffers { +struct io_uring_clone_buffers { __u32 src_fd; __u32 flags; __u32 pad[6]; diff --git a/io_uring/register.c b/io_uring/register.c index dab0f8024ddf..b8a48a6a89ee 100644 --- a/io_uring/register.c +++ b/io_uring/register.c @@ -542,11 +542,11 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode, break; ret = io_register_clock(ctx, arg); break; - case IORING_REGISTER_COPY_BUFFERS: + case IORING_REGISTER_CLONE_BUFFERS: ret = -EINVAL; if (!arg || nr_args != 1) break; - ret = io_register_copy_buffers(ctx, arg); + ret = io_register_clone_buffers(ctx, arg); break; default: ret = -EINVAL; diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c index 40696a395f0a..9264e555ae59 100644 --- a/io_uring/rsrc.c +++ b/io_uring/rsrc.c @@ -1139,7 +1139,7 @@ int io_import_fixed(int ddir, struct iov_iter *iter, return 0; } -static int io_copy_buffers(struct io_ring_ctx *ctx, struct io_ring_ctx *src_ctx) +static int io_clone_buffers(struct io_ring_ctx *ctx, struct io_ring_ctx *src_ctx) { struct io_mapped_ubuf **user_bufs; struct io_rsrc_data *data; @@ -1203,9 +1203,9 @@ out_unlock: * * Since the memory is already accounted once, don't account it again. */ -int io_register_copy_buffers(struct io_ring_ctx *ctx, void __user *arg) +int io_register_clone_buffers(struct io_ring_ctx *ctx, void __user *arg) { - struct io_uring_copy_buffers buf; + struct io_uring_clone_buffers buf; bool registered_src; struct file *file; int ret; @@ -1223,7 +1223,7 @@ int io_register_copy_buffers(struct io_ring_ctx *ctx, void __user *arg) file = io_uring_register_get_file(buf.src_fd, registered_src); if (IS_ERR(file)) return PTR_ERR(file); - ret = io_copy_buffers(ctx, file->private_data); + ret = io_clone_buffers(ctx, file->private_data); if (!registered_src) fput(file); return ret; diff --git a/io_uring/rsrc.h b/io_uring/rsrc.h index 93546ab337a6..eb4803e473b0 100644 --- a/io_uring/rsrc.h +++ b/io_uring/rsrc.h @@ -68,7 +68,7 @@ int io_import_fixed(int ddir, struct iov_iter *iter, struct io_mapped_ubuf *imu, u64 buf_addr, size_t len); -int io_register_copy_buffers(struct io_ring_ctx *ctx, void __user *arg); +int io_register_clone_buffers(struct io_ring_ctx *ctx, void __user *arg); void __io_sqe_buffers_unregister(struct io_ring_ctx *ctx); int io_sqe_buffers_unregister(struct io_ring_ctx *ctx); int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg, -- cgit v1.2.3 From 21d52e295ad2afc76bbd105da82a003b96f6ac77 Mon Sep 17 00:00:00 2001 From: Tahera Fahimi Date: Wed, 4 Sep 2024 18:13:55 -0600 Subject: landlock: Add abstract UNIX socket scoping MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Introduce a new "scoped" member to landlock_ruleset_attr that can specify LANDLOCK_SCOPE_ABSTRACT_UNIX_SOCKET to restrict connection to abstract UNIX sockets from a process outside of the socket's domain. Two hooks are implemented to enforce these restrictions: unix_stream_connect and unix_may_send. Closes: https://github.com/landlock-lsm/linux/issues/7 Signed-off-by: Tahera Fahimi Link: https://lore.kernel.org/r/5f7ad85243b78427242275b93481cfc7c127764b.1725494372.git.fahimitahera@gmail.com [mic: Fix commit message formatting, improve documentation, simplify hook_unix_may_send(), and cosmetic fixes including rename of LANDLOCK_SCOPED_ABSTRACT_UNIX_SOCKET] Co-developed-by: Mickaël Salaün Signed-off-by: Mickaël Salaün --- include/uapi/linux/landlock.h | 27 ++++++ security/landlock/limits.h | 3 + security/landlock/ruleset.c | 7 +- security/landlock/ruleset.h | 24 ++++- security/landlock/syscalls.c | 17 +++- security/landlock/task.c | 137 +++++++++++++++++++++++++++ tools/testing/selftests/landlock/base_test.c | 2 +- 7 files changed, 208 insertions(+), 9 deletions(-) (limited to 'include/uapi') diff --git a/include/uapi/linux/landlock.h b/include/uapi/linux/landlock.h index 2c8dbc74b955..70edd17bafdc 100644 --- a/include/uapi/linux/landlock.h +++ b/include/uapi/linux/landlock.h @@ -44,6 +44,12 @@ struct landlock_ruleset_attr { * flags`_). */ __u64 handled_access_net; + /** + * @scoped: Bitmask of scopes (cf. `Scope flags`_) + * restricting a Landlock domain from accessing outside + * resources (e.g. IPCs). + */ + __u64 scoped; }; /* @@ -274,4 +280,25 @@ struct landlock_net_port_attr { #define LANDLOCK_ACCESS_NET_BIND_TCP (1ULL << 0) #define LANDLOCK_ACCESS_NET_CONNECT_TCP (1ULL << 1) /* clang-format on */ + +/** + * DOC: scope + * + * Scope flags + * ~~~~~~~~~~~ + * + * These flags enable to isolate a sandboxed process from a set of IPC actions. + * Setting a flag for a ruleset will isolate the Landlock domain to forbid + * connections to resources outside the domain. + * + * Scopes: + * + * - %LANDLOCK_SCOPE_ABSTRACT_UNIX_SOCKET: Restrict a sandboxed process from + * connecting to an abstract UNIX socket created by a process outside the + * related Landlock domain (e.g. a parent domain or a non-sandboxed process). + */ +/* clang-format off */ +#define LANDLOCK_SCOPE_ABSTRACT_UNIX_SOCKET (1ULL << 0) +/* clang-format on*/ + #endif /* _UAPI_LINUX_LANDLOCK_H */ diff --git a/security/landlock/limits.h b/security/landlock/limits.h index 4eb643077a2a..d74818003ed4 100644 --- a/security/landlock/limits.h +++ b/security/landlock/limits.h @@ -26,6 +26,9 @@ #define LANDLOCK_MASK_ACCESS_NET ((LANDLOCK_LAST_ACCESS_NET << 1) - 1) #define LANDLOCK_NUM_ACCESS_NET __const_hweight64(LANDLOCK_MASK_ACCESS_NET) +#define LANDLOCK_LAST_SCOPE LANDLOCK_SCOPE_ABSTRACT_UNIX_SOCKET +#define LANDLOCK_MASK_SCOPE ((LANDLOCK_LAST_SCOPE << 1) - 1) +#define LANDLOCK_NUM_SCOPE __const_hweight64(LANDLOCK_MASK_SCOPE) /* clang-format on */ #endif /* _SECURITY_LANDLOCK_LIMITS_H */ diff --git a/security/landlock/ruleset.c b/security/landlock/ruleset.c index 6ff232f58618..a93bdbf52fff 100644 --- a/security/landlock/ruleset.c +++ b/security/landlock/ruleset.c @@ -52,12 +52,13 @@ static struct landlock_ruleset *create_ruleset(const u32 num_layers) struct landlock_ruleset * landlock_create_ruleset(const access_mask_t fs_access_mask, - const access_mask_t net_access_mask) + const access_mask_t net_access_mask, + const access_mask_t scope_mask) { struct landlock_ruleset *new_ruleset; /* Informs about useless ruleset. */ - if (!fs_access_mask && !net_access_mask) + if (!fs_access_mask && !net_access_mask && !scope_mask) return ERR_PTR(-ENOMSG); new_ruleset = create_ruleset(1); if (IS_ERR(new_ruleset)) @@ -66,6 +67,8 @@ landlock_create_ruleset(const access_mask_t fs_access_mask, landlock_add_fs_access_mask(new_ruleset, fs_access_mask, 0); if (net_access_mask) landlock_add_net_access_mask(new_ruleset, net_access_mask, 0); + if (scope_mask) + landlock_add_scope_mask(new_ruleset, scope_mask, 0); return new_ruleset; } diff --git a/security/landlock/ruleset.h b/security/landlock/ruleset.h index 0f1b5b4c8f6b..61bdbc550172 100644 --- a/security/landlock/ruleset.h +++ b/security/landlock/ruleset.h @@ -35,6 +35,8 @@ typedef u16 access_mask_t; static_assert(BITS_PER_TYPE(access_mask_t) >= LANDLOCK_NUM_ACCESS_FS); /* Makes sure all network access rights can be stored. */ static_assert(BITS_PER_TYPE(access_mask_t) >= LANDLOCK_NUM_ACCESS_NET); +/* Makes sure all scoped rights can be stored. */ +static_assert(BITS_PER_TYPE(access_mask_t) >= LANDLOCK_NUM_SCOPE); /* Makes sure for_each_set_bit() and for_each_clear_bit() calls are OK. */ static_assert(sizeof(unsigned long) >= sizeof(access_mask_t)); @@ -42,6 +44,7 @@ static_assert(sizeof(unsigned long) >= sizeof(access_mask_t)); struct access_masks { access_mask_t fs : LANDLOCK_NUM_ACCESS_FS; access_mask_t net : LANDLOCK_NUM_ACCESS_NET; + access_mask_t scope : LANDLOCK_NUM_SCOPE; }; typedef u16 layer_mask_t; @@ -233,7 +236,8 @@ struct landlock_ruleset { struct landlock_ruleset * landlock_create_ruleset(const access_mask_t access_mask_fs, - const access_mask_t access_mask_net); + const access_mask_t access_mask_net, + const access_mask_t scope_mask); void landlock_put_ruleset(struct landlock_ruleset *const ruleset); void landlock_put_ruleset_deferred(struct landlock_ruleset *const ruleset); @@ -280,6 +284,17 @@ landlock_add_net_access_mask(struct landlock_ruleset *const ruleset, ruleset->access_masks[layer_level].net |= net_mask; } +static inline void +landlock_add_scope_mask(struct landlock_ruleset *const ruleset, + const access_mask_t scope_mask, const u16 layer_level) +{ + access_mask_t mask = scope_mask & LANDLOCK_MASK_SCOPE; + + /* Should already be checked in sys_landlock_create_ruleset(). */ + WARN_ON_ONCE(scope_mask != mask); + ruleset->access_masks[layer_level].scope |= mask; +} + static inline access_mask_t landlock_get_raw_fs_access_mask(const struct landlock_ruleset *const ruleset, const u16 layer_level) @@ -303,6 +318,13 @@ landlock_get_net_access_mask(const struct landlock_ruleset *const ruleset, return ruleset->access_masks[layer_level].net; } +static inline access_mask_t +landlock_get_scope_mask(const struct landlock_ruleset *const ruleset, + const u16 layer_level) +{ + return ruleset->access_masks[layer_level].scope; +} + bool landlock_unmask_layers(const struct landlock_rule *const rule, const access_mask_t access_request, layer_mask_t (*const layer_masks)[], diff --git a/security/landlock/syscalls.c b/security/landlock/syscalls.c index ccc8bc6c1584..c67836841e46 100644 --- a/security/landlock/syscalls.c +++ b/security/landlock/syscalls.c @@ -97,8 +97,9 @@ static void build_check_abi(void) */ ruleset_size = sizeof(ruleset_attr.handled_access_fs); ruleset_size += sizeof(ruleset_attr.handled_access_net); + ruleset_size += sizeof(ruleset_attr.scoped); BUILD_BUG_ON(sizeof(ruleset_attr) != ruleset_size); - BUILD_BUG_ON(sizeof(ruleset_attr) != 16); + BUILD_BUG_ON(sizeof(ruleset_attr) != 24); path_beneath_size = sizeof(path_beneath_attr.allowed_access); path_beneath_size += sizeof(path_beneath_attr.parent_fd); @@ -149,7 +150,7 @@ static const struct file_operations ruleset_fops = { .write = fop_dummy_write, }; -#define LANDLOCK_ABI_VERSION 5 +#define LANDLOCK_ABI_VERSION 6 /** * sys_landlock_create_ruleset - Create a new ruleset @@ -170,8 +171,9 @@ static const struct file_operations ruleset_fops = { * Possible returned errors are: * * - %EOPNOTSUPP: Landlock is supported by the kernel but disabled at boot time; - * - %EINVAL: unknown @flags, or unknown access, or too small @size; - * - %E2BIG or %EFAULT: @attr or @size inconsistencies; + * - %EINVAL: unknown @flags, or unknown access, or unknown scope, or too small @size; + * - %E2BIG: @attr or @size inconsistencies; + * - %EFAULT: @attr or @size inconsistencies; * - %ENOMSG: empty &landlock_ruleset_attr.handled_access_fs. */ SYSCALL_DEFINE3(landlock_create_ruleset, @@ -213,9 +215,14 @@ SYSCALL_DEFINE3(landlock_create_ruleset, LANDLOCK_MASK_ACCESS_NET) return -EINVAL; + /* Checks IPC scoping content (and 32-bits cast). */ + if ((ruleset_attr.scoped | LANDLOCK_MASK_SCOPE) != LANDLOCK_MASK_SCOPE) + return -EINVAL; + /* Checks arguments and transforms to kernel struct. */ ruleset = landlock_create_ruleset(ruleset_attr.handled_access_fs, - ruleset_attr.handled_access_net); + ruleset_attr.handled_access_net, + ruleset_attr.scoped); if (IS_ERR(ruleset)) return PTR_ERR(ruleset); diff --git a/security/landlock/task.c b/security/landlock/task.c index 849f5123610b..4f8013ca412e 100644 --- a/security/landlock/task.c +++ b/security/landlock/task.c @@ -13,6 +13,8 @@ #include #include #include +#include +#include #include "common.h" #include "cred.h" @@ -108,9 +110,144 @@ static int hook_ptrace_traceme(struct task_struct *const parent) return task_ptrace(parent, current); } +/** + * domain_is_scoped - Checks if the client domain is scoped in the same + * domain as the server. + * + * @client: IPC sender domain. + * @server: IPC receiver domain. + * @scope: The scope restriction criteria. + * + * Returns: True if the @client domain is scoped to access the @server, + * unless the @server is also scoped in the same domain as @client. + */ +static bool domain_is_scoped(const struct landlock_ruleset *const client, + const struct landlock_ruleset *const server, + access_mask_t scope) +{ + int client_layer, server_layer; + struct landlock_hierarchy *client_walker, *server_walker; + + /* Quick return if client has no domain */ + if (WARN_ON_ONCE(!client)) + return false; + + client_layer = client->num_layers - 1; + client_walker = client->hierarchy; + /* + * client_layer must be a signed integer with greater capacity + * than client->num_layers to ensure the following loop stops. + */ + BUILD_BUG_ON(sizeof(client_layer) > sizeof(client->num_layers)); + + server_layer = server ? (server->num_layers - 1) : -1; + server_walker = server ? server->hierarchy : NULL; + + /* + * Walks client's parent domains down to the same hierarchy level + * as the server's domain, and checks that none of these client's + * parent domains are scoped. + */ + for (; client_layer > server_layer; client_layer--) { + if (landlock_get_scope_mask(client, client_layer) & scope) + return true; + + client_walker = client_walker->parent; + } + /* + * Walks server's parent domains down to the same hierarchy level as + * the client's domain. + */ + for (; server_layer > client_layer; server_layer--) + server_walker = server_walker->parent; + + for (; client_layer >= 0; client_layer--) { + if (landlock_get_scope_mask(client, client_layer) & scope) { + /* + * Client and server are at the same level in the + * hierarchy. If the client is scoped, the request is + * only allowed if this domain is also a server's + * ancestor. + */ + return server_walker != client_walker; + } + client_walker = client_walker->parent; + server_walker = server_walker->parent; + } + return false; +} + +static bool sock_is_scoped(struct sock *const other, + const struct landlock_ruleset *const domain) +{ + const struct landlock_ruleset *dom_other; + + /* The credentials will not change. */ + lockdep_assert_held(&unix_sk(other)->lock); + dom_other = landlock_cred(other->sk_socket->file->f_cred)->domain; + return domain_is_scoped(domain, dom_other, + LANDLOCK_SCOPE_ABSTRACT_UNIX_SOCKET); +} + +static bool is_abstract_socket(struct sock *const sock) +{ + struct unix_address *addr = unix_sk(sock)->addr; + + if (!addr) + return false; + + if (addr->len >= offsetof(struct sockaddr_un, sun_path) + 1 && + addr->name->sun_path[0] == '\0') + return true; + + return false; +} + +static int hook_unix_stream_connect(struct sock *const sock, + struct sock *const other, + struct sock *const newsk) +{ + const struct landlock_ruleset *const dom = + landlock_get_current_domain(); + + /* Quick return for non-landlocked tasks. */ + if (!dom) + return 0; + + if (is_abstract_socket(other) && sock_is_scoped(other, dom)) + return -EPERM; + + return 0; +} + +static int hook_unix_may_send(struct socket *const sock, + struct socket *const other) +{ + const struct landlock_ruleset *const dom = + landlock_get_current_domain(); + + if (!dom) + return 0; + + /* + * Checks if this datagram socket was already allowed to be connected + * to other. + */ + if (unix_peer(sock->sk) == other->sk) + return 0; + + if (is_abstract_socket(other->sk) && sock_is_scoped(other->sk, dom)) + return -EPERM; + + return 0; +} + static struct security_hook_list landlock_hooks[] __ro_after_init = { LSM_HOOK_INIT(ptrace_access_check, hook_ptrace_access_check), LSM_HOOK_INIT(ptrace_traceme, hook_ptrace_traceme), + + LSM_HOOK_INIT(unix_stream_connect, hook_unix_stream_connect), + LSM_HOOK_INIT(unix_may_send, hook_unix_may_send), }; __init void landlock_add_task_hooks(void) diff --git a/tools/testing/selftests/landlock/base_test.c b/tools/testing/selftests/landlock/base_test.c index 3b26bf3cf5b9..1bc16fde2e8a 100644 --- a/tools/testing/selftests/landlock/base_test.c +++ b/tools/testing/selftests/landlock/base_test.c @@ -76,7 +76,7 @@ TEST(abi_version) const struct landlock_ruleset_attr ruleset_attr = { .handled_access_fs = LANDLOCK_ACCESS_FS_READ_FILE, }; - ASSERT_EQ(5, landlock_create_ruleset(NULL, 0, + ASSERT_EQ(6, landlock_create_ruleset(NULL, 0, LANDLOCK_CREATE_RULESET_VERSION)); ASSERT_EQ(-1, landlock_create_ruleset(&ruleset_attr, 0, -- cgit v1.2.3 From 54a6e6bbf3bef25c8eb65619edde70af49bd3db0 Mon Sep 17 00:00:00 2001 From: Tahera Fahimi Date: Fri, 6 Sep 2024 15:30:03 -0600 Subject: landlock: Add signal scoping MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Currently, a sandbox process is not restricted to sending a signal (e.g. SIGKILL) to a process outside the sandbox environment. The ability to send a signal for a sandboxed process should be scoped the same way abstract UNIX sockets are scoped. Therefore, we extend the "scoped" field in a ruleset with LANDLOCK_SCOPE_SIGNAL to specify that a ruleset will deny sending any signal from within a sandbox process to its parent (i.e. any parent sandbox or non-sandboxed processes). This patch adds file_set_fowner and file_free_security hooks to set and release a pointer to the file owner's domain. This pointer, fown_domain in landlock_file_security will be used in file_send_sigiotask to check if the process can send a signal. The ruleset_with_unknown_scope test is updated to support LANDLOCK_SCOPE_SIGNAL. This depends on two new changes: - commit 1934b212615d ("file: reclaim 24 bytes from f_owner"): replace container_of(fown, struct file, f_owner) with fown->file . - commit 26f204380a3c ("fs: Fix file_set_fowner LSM hook inconsistencies"): lock before calling the hook. Signed-off-by: Tahera Fahimi Closes: https://github.com/landlock-lsm/linux/issues/8 Link: https://lore.kernel.org/r/df2b4f880a2ed3042992689a793ea0951f6798a5.1725657727.git.fahimitahera@gmail.com [mic: Update landlock_get_current_domain()'s return type, improve and fix locking in hook_file_set_fowner(), simplify and fix sleepable call and locking issue in hook_file_send_sigiotask() and rebase on the latest VFS tree, simplify hook_task_kill() and quickly return when not sandboxed, improve comments, rename LANDLOCK_SCOPED_SIGNAL] Co-developed-by: Mickaël Salaün Signed-off-by: Mickaël Salaün --- include/uapi/linux/landlock.h | 3 ++ security/landlock/cred.h | 2 +- security/landlock/fs.c | 25 ++++++++++++ security/landlock/fs.h | 7 ++++ security/landlock/limits.h | 2 +- security/landlock/task.c | 56 ++++++++++++++++++++++++++ tools/testing/selftests/landlock/scoped_test.c | 2 +- 7 files changed, 94 insertions(+), 3 deletions(-) (limited to 'include/uapi') diff --git a/include/uapi/linux/landlock.h b/include/uapi/linux/landlock.h index 70edd17bafdc..33745642f787 100644 --- a/include/uapi/linux/landlock.h +++ b/include/uapi/linux/landlock.h @@ -296,9 +296,12 @@ struct landlock_net_port_attr { * - %LANDLOCK_SCOPE_ABSTRACT_UNIX_SOCKET: Restrict a sandboxed process from * connecting to an abstract UNIX socket created by a process outside the * related Landlock domain (e.g. a parent domain or a non-sandboxed process). + * - %LANDLOCK_SCOPE_SIGNAL: Restrict a sandboxed process from sending a signal + * to another process outside the domain. */ /* clang-format off */ #define LANDLOCK_SCOPE_ABSTRACT_UNIX_SOCKET (1ULL << 0) +#define LANDLOCK_SCOPE_SIGNAL (1ULL << 1) /* clang-format on*/ #endif /* _UAPI_LINUX_LANDLOCK_H */ diff --git a/security/landlock/cred.h b/security/landlock/cred.h index af89ab00e6d1..bf755459838a 100644 --- a/security/landlock/cred.h +++ b/security/landlock/cred.h @@ -26,7 +26,7 @@ landlock_cred(const struct cred *cred) return cred->security + landlock_blob_sizes.lbs_cred; } -static inline const struct landlock_ruleset *landlock_get_current_domain(void) +static inline struct landlock_ruleset *landlock_get_current_domain(void) { return landlock_cred(current_cred())->domain; } diff --git a/security/landlock/fs.c b/security/landlock/fs.c index 0804f76a67be..7d79fc8abe21 100644 --- a/security/landlock/fs.c +++ b/security/landlock/fs.c @@ -1639,6 +1639,29 @@ static int hook_file_ioctl_compat(struct file *file, unsigned int cmd, return -EACCES; } +static void hook_file_set_fowner(struct file *file) +{ + struct landlock_ruleset *new_dom, *prev_dom; + + /* + * Lock already held by __f_setown(), see commit 26f204380a3c ("fs: Fix + * file_set_fowner LSM hook inconsistencies"). + */ + lockdep_assert_held(&file_f_owner(file)->lock); + new_dom = landlock_get_current_domain(); + landlock_get_ruleset(new_dom); + prev_dom = landlock_file(file)->fown_domain; + landlock_file(file)->fown_domain = new_dom; + + /* Called in an RCU read-side critical section. */ + landlock_put_ruleset_deferred(prev_dom); +} + +static void hook_file_free_security(struct file *file) +{ + landlock_put_ruleset_deferred(landlock_file(file)->fown_domain); +} + static struct security_hook_list landlock_hooks[] __ro_after_init = { LSM_HOOK_INIT(inode_free_security_rcu, hook_inode_free_security_rcu), @@ -1663,6 +1686,8 @@ static struct security_hook_list landlock_hooks[] __ro_after_init = { LSM_HOOK_INIT(file_truncate, hook_file_truncate), LSM_HOOK_INIT(file_ioctl, hook_file_ioctl), LSM_HOOK_INIT(file_ioctl_compat, hook_file_ioctl_compat), + LSM_HOOK_INIT(file_set_fowner, hook_file_set_fowner), + LSM_HOOK_INIT(file_free_security, hook_file_free_security), }; __init void landlock_add_fs_hooks(void) diff --git a/security/landlock/fs.h b/security/landlock/fs.h index 488e4813680a..1487e1f023a1 100644 --- a/security/landlock/fs.h +++ b/security/landlock/fs.h @@ -52,6 +52,13 @@ struct landlock_file_security { * needed to authorize later operations on the open file. */ access_mask_t allowed_access; + /** + * @fown_domain: Domain of the task that set the PID that may receive a + * signal e.g., SIGURG when writing MSG_OOB to the related socket. + * This pointer is protected by the related file->f_owner->lock, as for + * fown_struct's members: pid, uid, and euid. + */ + struct landlock_ruleset *fown_domain; }; /** diff --git a/security/landlock/limits.h b/security/landlock/limits.h index d74818003ed4..15f7606066c8 100644 --- a/security/landlock/limits.h +++ b/security/landlock/limits.h @@ -26,7 +26,7 @@ #define LANDLOCK_MASK_ACCESS_NET ((LANDLOCK_LAST_ACCESS_NET << 1) - 1) #define LANDLOCK_NUM_ACCESS_NET __const_hweight64(LANDLOCK_MASK_ACCESS_NET) -#define LANDLOCK_LAST_SCOPE LANDLOCK_SCOPE_ABSTRACT_UNIX_SOCKET +#define LANDLOCK_LAST_SCOPE LANDLOCK_SCOPE_SIGNAL #define LANDLOCK_MASK_SCOPE ((LANDLOCK_LAST_SCOPE << 1) - 1) #define LANDLOCK_NUM_SCOPE __const_hweight64(LANDLOCK_MASK_SCOPE) /* clang-format on */ diff --git a/security/landlock/task.c b/security/landlock/task.c index 4f8013ca412e..4acbd7c40eee 100644 --- a/security/landlock/task.c +++ b/security/landlock/task.c @@ -18,6 +18,7 @@ #include "common.h" #include "cred.h" +#include "fs.h" #include "ruleset.h" #include "setup.h" #include "task.h" @@ -242,12 +243,67 @@ static int hook_unix_may_send(struct socket *const sock, return 0; } +static int hook_task_kill(struct task_struct *const p, + struct kernel_siginfo *const info, const int sig, + const struct cred *const cred) +{ + bool is_scoped; + const struct landlock_ruleset *dom; + + if (cred) { + /* Dealing with USB IO. */ + dom = landlock_cred(cred)->domain; + } else { + dom = landlock_get_current_domain(); + } + + /* Quick return for non-landlocked tasks. */ + if (!dom) + return 0; + + rcu_read_lock(); + is_scoped = domain_is_scoped(dom, landlock_get_task_domain(p), + LANDLOCK_SCOPE_SIGNAL); + rcu_read_unlock(); + if (is_scoped) + return -EPERM; + + return 0; +} + +static int hook_file_send_sigiotask(struct task_struct *tsk, + struct fown_struct *fown, int signum) +{ + const struct landlock_ruleset *dom; + bool is_scoped = false; + + /* Lock already held by send_sigio() and send_sigurg(). */ + lockdep_assert_held(&fown->lock); + dom = landlock_file(fown->file)->fown_domain; + + /* Quick return for unowned socket. */ + if (!dom) + return 0; + + rcu_read_lock(); + is_scoped = domain_is_scoped(dom, landlock_get_task_domain(tsk), + LANDLOCK_SCOPE_SIGNAL); + rcu_read_unlock(); + if (is_scoped) + return -EPERM; + + return 0; +} + static struct security_hook_list landlock_hooks[] __ro_after_init = { LSM_HOOK_INIT(ptrace_access_check, hook_ptrace_access_check), LSM_HOOK_INIT(ptrace_traceme, hook_ptrace_traceme), LSM_HOOK_INIT(unix_stream_connect, hook_unix_stream_connect), LSM_HOOK_INIT(unix_may_send, hook_unix_may_send), + + LSM_HOOK_INIT(task_kill, hook_task_kill), + LSM_HOOK_INIT(file_send_sigiotask, hook_file_send_sigiotask), }; __init void landlock_add_task_hooks(void) diff --git a/tools/testing/selftests/landlock/scoped_test.c b/tools/testing/selftests/landlock/scoped_test.c index 2d15f09d63e2..b90f76ed0d9c 100644 --- a/tools/testing/selftests/landlock/scoped_test.c +++ b/tools/testing/selftests/landlock/scoped_test.c @@ -12,7 +12,7 @@ #include "common.h" -#define ACCESS_LAST LANDLOCK_SCOPE_ABSTRACT_UNIX_SOCKET +#define ACCESS_LAST LANDLOCK_SCOPE_SIGNAL TEST(ruleset_with_unknown_scope) { -- cgit v1.2.3 From f761fcdd289d07e8547fef7ac76c3760fc7803f2 Mon Sep 17 00:00:00 2001 From: Dongliang Cui Date: Wed, 18 Sep 2024 07:40:05 +0900 Subject: exfat: Implement sops->shutdown and ioctl We found that when writing a large file through buffer write, if the disk is inaccessible, exFAT does not return an error normally, which leads to the writing process not stopping properly. To easily reproduce this issue, you can follow the steps below: 1. format a device to exFAT and then mount (with a full disk erase) 2. dd if=/dev/zero of=/exfat_mount/test.img bs=1M count=8192 3. eject the device You may find that the dd process does not stop immediately and may continue for a long time. The root cause of this issue is that during buffer write process, exFAT does not need to access the disk to look up directory entries or the FAT table (whereas FAT would do) every time data is written. Instead, exFAT simply marks the buffer as dirty and returns, delegating the writeback operation to the writeback process. If the disk cannot be accessed at this time, the error will only be returned to the writeback process, and the original process will not receive the error, so it cannot be returned to the user side. When the disk cannot be accessed normally, an error should be returned to stop the writing process. Implement sops->shutdown and ioctl to shut down the file system when underlying block device is marked dead. Signed-off-by: Dongliang Cui Signed-off-by: Zhiguo Niu Signed-off-by: Namjae Jeon --- fs/exfat/exfat_fs.h | 12 ++++++++++++ fs/exfat/file.c | 21 +++++++++++++++++++++ fs/exfat/inode.c | 9 +++++++++ fs/exfat/namei.c | 15 +++++++++++++++ fs/exfat/super.c | 39 +++++++++++++++++++++++++++++++++++++++ include/uapi/linux/exfat.h | 25 +++++++++++++++++++++++++ 6 files changed, 121 insertions(+) create mode 100644 include/uapi/linux/exfat.h (limited to 'include/uapi') diff --git a/fs/exfat/exfat_fs.h b/fs/exfat/exfat_fs.h index 5be214288a05..3cdc1de362a9 100644 --- a/fs/exfat/exfat_fs.h +++ b/fs/exfat/exfat_fs.h @@ -10,6 +10,7 @@ #include #include #include +#include #define EXFAT_ROOT_INO 1 @@ -148,6 +149,9 @@ enum { #define DIR_CACHE_SIZE \ (DIV_ROUND_UP(EXFAT_DEN_TO_B(ES_MAX_ENTRY_NUM), SECTOR_SIZE) + 1) +/* Superblock flags */ +#define EXFAT_FLAGS_SHUTDOWN 1 + struct exfat_dentry_namebuf { char *lfn; int lfnbuf_len; /* usually MAX_UNINAME_BUF_SIZE */ @@ -267,6 +271,8 @@ struct exfat_sb_info { unsigned int clu_srch_ptr; /* cluster search pointer */ unsigned int used_clusters; /* number of used clusters */ + unsigned long s_exfat_flags; /* Exfat superblock flags */ + struct mutex s_lock; /* superblock lock */ struct mutex bitmap_lock; /* bitmap lock */ struct exfat_mount_options options; @@ -331,6 +337,11 @@ static inline struct exfat_inode_info *EXFAT_I(struct inode *inode) return container_of(inode, struct exfat_inode_info, vfs_inode); } +static inline int exfat_forced_shutdown(struct super_block *sb) +{ + return test_bit(EXFAT_FLAGS_SHUTDOWN, &EXFAT_SB(sb)->s_exfat_flags); +} + /* * If ->i_mode can't hold 0222 (i.e. ATTR_RO), we use ->i_attrs to * save ATTR_RO instead of ->i_mode. @@ -459,6 +470,7 @@ int exfat_file_fsync(struct file *file, loff_t start, loff_t end, int datasync); long exfat_ioctl(struct file *filp, unsigned int cmd, unsigned long arg); long exfat_compat_ioctl(struct file *filp, unsigned int cmd, unsigned long arg); +int exfat_force_shutdown(struct super_block *sb, u32 flags); /* namei.c */ extern const struct dentry_operations exfat_dentry_ops; diff --git a/fs/exfat/file.c b/fs/exfat/file.c index 860c1f3b2cd7..a945ede3bf97 100644 --- a/fs/exfat/file.c +++ b/fs/exfat/file.c @@ -287,6 +287,9 @@ int exfat_setattr(struct mnt_idmap *idmap, struct dentry *dentry, unsigned int ia_valid; int error; + if (unlikely(exfat_forced_shutdown(inode->i_sb))) + return -EIO; + if ((attr->ia_valid & ATTR_SIZE) && attr->ia_size > i_size_read(inode)) { error = exfat_cont_expand(inode, attr->ia_size); @@ -470,6 +473,19 @@ static int exfat_ioctl_fitrim(struct inode *inode, unsigned long arg) return 0; } +static int exfat_ioctl_shutdown(struct super_block *sb, unsigned long arg) +{ + u32 flags; + + if (!capable(CAP_SYS_ADMIN)) + return -EPERM; + + if (get_user(flags, (__u32 __user *)arg)) + return -EFAULT; + + return exfat_force_shutdown(sb, flags); +} + long exfat_ioctl(struct file *filp, unsigned int cmd, unsigned long arg) { struct inode *inode = file_inode(filp); @@ -480,6 +496,8 @@ long exfat_ioctl(struct file *filp, unsigned int cmd, unsigned long arg) return exfat_ioctl_get_attributes(inode, user_attr); case FAT_IOCTL_SET_ATTRIBUTES: return exfat_ioctl_set_attributes(filp, user_attr); + case EXFAT_IOC_SHUTDOWN: + return exfat_ioctl_shutdown(inode->i_sb, arg); case FITRIM: return exfat_ioctl_fitrim(inode, arg); default: @@ -500,6 +518,9 @@ int exfat_file_fsync(struct file *filp, loff_t start, loff_t end, int datasync) struct inode *inode = filp->f_mapping->host; int err; + if (unlikely(exfat_forced_shutdown(inode->i_sb))) + return -EIO; + err = __generic_file_fsync(filp, start, end, datasync); if (err) return err; diff --git a/fs/exfat/inode.c b/fs/exfat/inode.c index e7a74a948be1..d724de8f57bf 100644 --- a/fs/exfat/inode.c +++ b/fs/exfat/inode.c @@ -102,6 +102,9 @@ int exfat_write_inode(struct inode *inode, struct writeback_control *wbc) { int ret; + if (unlikely(exfat_forced_shutdown(inode->i_sb))) + return -EIO; + mutex_lock(&EXFAT_SB(inode->i_sb)->s_lock); ret = __exfat_write_inode(inode, wbc->sync_mode == WB_SYNC_ALL); mutex_unlock(&EXFAT_SB(inode->i_sb)->s_lock); @@ -402,6 +405,9 @@ static void exfat_readahead(struct readahead_control *rac) static int exfat_writepages(struct address_space *mapping, struct writeback_control *wbc) { + if (unlikely(exfat_forced_shutdown(mapping->host->i_sb))) + return -EIO; + return mpage_writepages(mapping, wbc, exfat_get_block); } @@ -422,6 +428,9 @@ static int exfat_write_begin(struct file *file, struct address_space *mapping, { int ret; + if (unlikely(exfat_forced_shutdown(mapping->host->i_sb))) + return -EIO; + ret = block_write_begin(mapping, pos, len, foliop, exfat_get_block); if (ret < 0) diff --git a/fs/exfat/namei.c b/fs/exfat/namei.c index 3e6c789a72c4..2c4c44229352 100644 --- a/fs/exfat/namei.c +++ b/fs/exfat/namei.c @@ -547,6 +547,9 @@ static int exfat_create(struct mnt_idmap *idmap, struct inode *dir, int err; loff_t size = i_size_read(dir); + if (unlikely(exfat_forced_shutdown(sb))) + return -EIO; + mutex_lock(&EXFAT_SB(sb)->s_lock); exfat_set_volume_dirty(sb); err = exfat_add_entry(dir, dentry->d_name.name, &cdir, TYPE_FILE, @@ -770,6 +773,9 @@ static int exfat_unlink(struct inode *dir, struct dentry *dentry) struct exfat_entry_set_cache es; int entry, err = 0; + if (unlikely(exfat_forced_shutdown(sb))) + return -EIO; + mutex_lock(&EXFAT_SB(sb)->s_lock); exfat_chain_dup(&cdir, &ei->dir); entry = ei->entry; @@ -823,6 +829,9 @@ static int exfat_mkdir(struct mnt_idmap *idmap, struct inode *dir, int err; loff_t size = i_size_read(dir); + if (unlikely(exfat_forced_shutdown(sb))) + return -EIO; + mutex_lock(&EXFAT_SB(sb)->s_lock); exfat_set_volume_dirty(sb); err = exfat_add_entry(dir, dentry->d_name.name, &cdir, TYPE_DIR, @@ -913,6 +922,9 @@ static int exfat_rmdir(struct inode *dir, struct dentry *dentry) struct exfat_entry_set_cache es; int entry, err; + if (unlikely(exfat_forced_shutdown(sb))) + return -EIO; + mutex_lock(&EXFAT_SB(inode->i_sb)->s_lock); exfat_chain_dup(&cdir, &ei->dir); @@ -980,6 +992,9 @@ static int exfat_rename_file(struct inode *inode, struct exfat_chain *p_dir, struct exfat_entry_set_cache old_es, new_es; int sync = IS_DIRSYNC(inode); + if (unlikely(exfat_forced_shutdown(sb))) + return -EIO; + num_new_entries = exfat_calc_num_entries(p_uniname); if (num_new_entries < 0) return num_new_entries; diff --git a/fs/exfat/super.c b/fs/exfat/super.c index 904583f87433..bd57844414aa 100644 --- a/fs/exfat/super.c +++ b/fs/exfat/super.c @@ -46,6 +46,9 @@ static int exfat_sync_fs(struct super_block *sb, int wait) struct exfat_sb_info *sbi = EXFAT_SB(sb); int err = 0; + if (unlikely(exfat_forced_shutdown(sb))) + return 0; + if (!wait) return 0; @@ -167,6 +170,41 @@ static int exfat_show_options(struct seq_file *m, struct dentry *root) return 0; } +int exfat_force_shutdown(struct super_block *sb, u32 flags) +{ + int ret; + struct exfat_sb_info *sbi = sb->s_fs_info; + struct exfat_mount_options *opts = &sbi->options; + + if (exfat_forced_shutdown(sb)) + return 0; + + switch (flags) { + case EXFAT_GOING_DOWN_DEFAULT: + case EXFAT_GOING_DOWN_FULLSYNC: + ret = bdev_freeze(sb->s_bdev); + if (ret) + return ret; + bdev_thaw(sb->s_bdev); + set_bit(EXFAT_FLAGS_SHUTDOWN, &sbi->s_exfat_flags); + break; + case EXFAT_GOING_DOWN_NOSYNC: + set_bit(EXFAT_FLAGS_SHUTDOWN, &sbi->s_exfat_flags); + break; + default: + return -EINVAL; + } + + if (opts->discard) + opts->discard = 0; + return 0; +} + +static void exfat_shutdown(struct super_block *sb) +{ + exfat_force_shutdown(sb, EXFAT_GOING_DOWN_NOSYNC); +} + static struct inode *exfat_alloc_inode(struct super_block *sb) { struct exfat_inode_info *ei; @@ -193,6 +231,7 @@ static const struct super_operations exfat_sops = { .sync_fs = exfat_sync_fs, .statfs = exfat_statfs, .show_options = exfat_show_options, + .shutdown = exfat_shutdown, }; enum { diff --git a/include/uapi/linux/exfat.h b/include/uapi/linux/exfat.h new file mode 100644 index 000000000000..46d95b16fc4b --- /dev/null +++ b/include/uapi/linux/exfat.h @@ -0,0 +1,25 @@ +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ +/* + * Copyright (C) 2024 Unisoc Technologies Co., Ltd. + */ + +#ifndef _UAPI_LINUX_EXFAT_H +#define _UAPI_LINUX_EXFAT_H +#include +#include + +/* + * exfat-specific ioctl commands + */ + +#define EXFAT_IOC_SHUTDOWN _IOR('X', 125, __u32) + +/* + * Flags used by EXFAT_IOC_SHUTDOWN + */ + +#define EXFAT_GOING_DOWN_DEFAULT 0x0 /* default with full sync */ +#define EXFAT_GOING_DOWN_FULLSYNC 0x1 /* going down with full sync*/ +#define EXFAT_GOING_DOWN_NOSYNC 0x2 /* going down */ + +#endif /* _UAPI_LINUX_EXFAT_H */ -- cgit v1.2.3 From 2fae6bb7be320270801b3c3b040189bd7daa8056 Mon Sep 17 00:00:00 2001 From: Jiqian Chen Date: Tue, 24 Sep 2024 14:14:37 +0800 Subject: xen/privcmd: Add new syscall to get gsi from dev On PVH dom0, when passthrough a device to domU, QEMU and xl tools want to use gsi number to do pirq mapping, see QEMU code xen_pt_realize->xc_physdev_map_pirq, and xl code pci_add_dm_done->xc_physdev_map_pirq, but in current codes, the gsi number is got from file /sys/bus/pci/devices//irq, that is wrong, because irq is not equal with gsi, they are in different spaces, so pirq mapping fails. And in current linux codes, there is no method to get gsi for userspace. For above purpose, record gsi of pcistub devices when init pcistub and add a new syscall into privcmd to let userspace can get gsi when they have a need. Signed-off-by: Jiqian Chen Signed-off-by: Huang Rui Signed-off-by: Jiqian Chen Reviewed-by: Stefano Stabellini Message-ID: <20240924061437.2636766-4-Jiqian.Chen@amd.com> Signed-off-by: Juergen Gross --- drivers/xen/Kconfig | 1 + drivers/xen/privcmd.c | 32 ++++++++++++++++++++++++++++++++ drivers/xen/xen-pciback/pci_stub.c | 38 +++++++++++++++++++++++++++++++++++--- include/uapi/xen/privcmd.h | 7 +++++++ include/xen/acpi.h | 9 +++++++++ 5 files changed, 84 insertions(+), 3 deletions(-) (limited to 'include/uapi') diff --git a/drivers/xen/Kconfig b/drivers/xen/Kconfig index d5989871dd5d..fd83d51df2f4 100644 --- a/drivers/xen/Kconfig +++ b/drivers/xen/Kconfig @@ -261,6 +261,7 @@ config XEN_SCSI_BACKEND config XEN_PRIVCMD tristate "Xen hypercall passthrough driver" depends on XEN + imply CONFIG_XEN_PCIDEV_BACKEND default m help The hypercall passthrough driver allows privileged user programs to diff --git a/drivers/xen/privcmd.c b/drivers/xen/privcmd.c index 9563650dfbaf..588f99a2b8df 100644 --- a/drivers/xen/privcmd.c +++ b/drivers/xen/privcmd.c @@ -46,6 +46,9 @@ #include #include #include +#ifdef CONFIG_XEN_ACPI +#include +#endif #include "privcmd.h" @@ -844,6 +847,31 @@ out: return rc; } +static long privcmd_ioctl_pcidev_get_gsi(struct file *file, void __user *udata) +{ +#if defined(CONFIG_XEN_ACPI) + int rc = -EINVAL; + struct privcmd_pcidev_get_gsi kdata; + + if (copy_from_user(&kdata, udata, sizeof(kdata))) + return -EFAULT; + + if (IS_REACHABLE(CONFIG_XEN_PCIDEV_BACKEND)) + rc = pcistub_get_gsi_from_sbdf(kdata.sbdf); + + if (rc < 0) + return rc; + + kdata.gsi = rc; + if (copy_to_user(udata, &kdata, sizeof(kdata))) + return -EFAULT; + + return 0; +#else + return -EINVAL; +#endif +} + #ifdef CONFIG_XEN_PRIVCMD_EVENTFD /* Irqfd support */ static struct workqueue_struct *irqfd_cleanup_wq; @@ -1543,6 +1571,10 @@ static long privcmd_ioctl(struct file *file, ret = privcmd_ioctl_ioeventfd(file, udata); break; + case IOCTL_PRIVCMD_PCIDEV_GET_GSI: + ret = privcmd_ioctl_pcidev_get_gsi(file, udata); + break; + default: break; } diff --git a/drivers/xen/xen-pciback/pci_stub.c b/drivers/xen/xen-pciback/pci_stub.c index 8ce27333f54b..d003402ce66b 100644 --- a/drivers/xen/xen-pciback/pci_stub.c +++ b/drivers/xen/xen-pciback/pci_stub.c @@ -56,6 +56,9 @@ struct pcistub_device { struct pci_dev *dev; struct xen_pcibk_device *pdev;/* non-NULL if struct pci_dev is in use */ +#ifdef CONFIG_XEN_ACPI + int gsi; +#endif }; /* Access to pcistub_devices & seized_devices lists and the initialize_devices @@ -88,6 +91,9 @@ static struct pcistub_device *pcistub_device_alloc(struct pci_dev *dev) kref_init(&psdev->kref); spin_lock_init(&psdev->lock); +#ifdef CONFIG_XEN_ACPI + psdev->gsi = -1; +#endif return psdev; } @@ -220,6 +226,25 @@ static struct pci_dev *pcistub_device_get_pci_dev(struct xen_pcibk_device *pdev, return pci_dev; } +#ifdef CONFIG_XEN_ACPI +int pcistub_get_gsi_from_sbdf(unsigned int sbdf) +{ + struct pcistub_device *psdev; + int domain = (sbdf >> 16) & 0xffff; + int bus = PCI_BUS_NUM(sbdf); + int slot = PCI_SLOT(sbdf); + int func = PCI_FUNC(sbdf); + + psdev = pcistub_device_find(domain, bus, slot, func); + + if (!psdev) + return -ENODEV; + + return psdev->gsi; +} +EXPORT_SYMBOL_GPL(pcistub_get_gsi_from_sbdf); +#endif + struct pci_dev *pcistub_get_pci_dev_by_slot(struct xen_pcibk_device *pdev, int domain, int bus, int slot, int func) @@ -367,14 +392,20 @@ static int pcistub_match(struct pci_dev *dev) return found; } -static int pcistub_init_device(struct pci_dev *dev) +static int pcistub_init_device(struct pcistub_device *psdev) { struct xen_pcibk_dev_data *dev_data; + struct pci_dev *dev; #ifdef CONFIG_XEN_ACPI int gsi, trigger, polarity; #endif int err = 0; + if (!psdev) + return -EINVAL; + + dev = psdev->dev; + dev_dbg(&dev->dev, "initializing...\n"); /* The PCI backend is not intended to be a module (or to work with @@ -452,6 +483,7 @@ static int pcistub_init_device(struct pci_dev *dev) err = xen_pvh_setup_gsi(gsi, trigger, polarity); if (err) goto config_release; + psdev->gsi = gsi; } #endif @@ -494,7 +526,7 @@ static int __init pcistub_init_devices_late(void) spin_unlock_irqrestore(&pcistub_devices_lock, flags); - err = pcistub_init_device(psdev->dev); + err = pcistub_init_device(psdev); if (err) { dev_err(&psdev->dev->dev, "error %d initializing device\n", err); @@ -564,7 +596,7 @@ static int pcistub_seize(struct pci_dev *dev, spin_unlock_irqrestore(&pcistub_devices_lock, flags); /* don't want irqs disabled when calling pcistub_init_device */ - err = pcistub_init_device(psdev->dev); + err = pcistub_init_device(psdev); spin_lock_irqsave(&pcistub_devices_lock, flags); diff --git a/include/uapi/xen/privcmd.h b/include/uapi/xen/privcmd.h index 8b8c5d1420fe..8e2c8fd44764 100644 --- a/include/uapi/xen/privcmd.h +++ b/include/uapi/xen/privcmd.h @@ -126,6 +126,11 @@ struct privcmd_ioeventfd { __u8 pad[2]; }; +struct privcmd_pcidev_get_gsi { + __u32 sbdf; + __u32 gsi; +}; + /* * @cmd: IOCTL_PRIVCMD_HYPERCALL * @arg: &privcmd_hypercall_t @@ -157,5 +162,7 @@ struct privcmd_ioeventfd { _IOW('P', 8, struct privcmd_irqfd) #define IOCTL_PRIVCMD_IOEVENTFD \ _IOW('P', 9, struct privcmd_ioeventfd) +#define IOCTL_PRIVCMD_PCIDEV_GET_GSI \ + _IOC(_IOC_NONE, 'P', 10, sizeof(struct privcmd_pcidev_get_gsi)) #endif /* __LINUX_PUBLIC_PRIVCMD_H__ */ diff --git a/include/xen/acpi.h b/include/xen/acpi.h index 3bcfe82d9078..daa96a22d257 100644 --- a/include/xen/acpi.h +++ b/include/xen/acpi.h @@ -91,4 +91,13 @@ static inline int xen_acpi_get_gsi_info(struct pci_dev *dev, } #endif +#ifdef CONFIG_XEN_PCI_STUB +int pcistub_get_gsi_from_sbdf(unsigned int sbdf); +#else +static inline int pcistub_get_gsi_from_sbdf(unsigned int sbdf) +{ + return -1; +} +#endif + #endif /* _XEN_ACPI_H */ -- cgit v1.2.3