<feed xmlns='http://www.w3.org/2005/Atom'>
<title>user/sven/linux.git/kernel/pid_namespace.c, branch v6.16.1</title>
<subtitle>Linux Kernel
</subtitle>
<id>https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v6.16.1</id>
<link rel='self' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v6.16.1'/>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/'/>
<updated>2025-03-06T09:18:36Z</updated>
<entry>
<title>pid: Do not set pid_max in new pid namespaces</title>
<updated>2025-03-06T09:18:36Z</updated>
<author>
<name>Michal Koutný</name>
<email>mkoutny@suse.com</email>
</author>
<published>2025-03-05T14:58:49Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=d385c8bceb14665e935419334aa3d3fac2f10456'/>
<id>urn:sha1:d385c8bceb14665e935419334aa3d3fac2f10456</id>
<content type='text'>
It is already difficult for users to troubleshoot which of multiple pid
limits restricts their workload. The per-(hierarchical-)NS pid_max would
contribute to the confusion.
Also, the implementation copies the limit upon creation from
parent, this pattern showed cumbersome with some attributes in legacy
cgroup controllers -- it's subject to race condition between parent's
limit modification and children creation and once copied it must be
changed in the descendant.

Let's do what other places do (ucounts or cgroup limits) -- create new
pid namespaces without any limit at all. The global limit (actually any
ancestor's limit) is still effectively in place, we avoid the
set/unshare race and bumps of global (ancestral) limit have the desired
effect on pid namespace that do not care.

Link: https://lore.kernel.org/r/20240408145819.8787-1-mkoutny@suse.com/
Link: https://lore.kernel.org/r/20250221170249.890014-1-mkoutny@suse.com/
Fixes: 7863dcc72d0f4 ("pid: allow pid_max to be set per pid namespace")
Signed-off-by: Michal Koutný &lt;mkoutny@suse.com&gt;
Link: https://lore.kernel.org/r/20250305145849.55491-1-mkoutny@suse.com
Signed-off-by: Christian Brauner &lt;brauner@kernel.org&gt;
</content>
</entry>
<entry>
<title>treewide: const qualify ctl_tables where applicable</title>
<updated>2025-01-28T12:48:37Z</updated>
<author>
<name>Joel Granados</name>
<email>joel.granados@kernel.org</email>
</author>
<published>2025-01-28T12:48:37Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=1751f872cc97f992ed5c4c72c55588db1f0021e1'/>
<id>urn:sha1:1751f872cc97f992ed5c4c72c55588db1f0021e1</id>
<content type='text'>
Add the const qualifier to all the ctl_tables in the tree except for
watchdog_hardlockup_sysctl, memory_allocation_profiling_sysctls,
loadpin_sysctl_table and the ones calling register_net_sysctl (./net,
drivers/inifiniband dirs). These are special cases as they use a
registration function with a non-const qualified ctl_table argument or
modify the arrays before passing them on to the registration function.

Constifying ctl_table structs will prevent the modification of
proc_handler function pointers as the arrays would reside in .rodata.
This is made possible after commit 78eb4ea25cd5 ("sysctl: treewide:
constify the ctl_table argument of proc_handlers") constified all the
proc_handlers.

Created this by running an spatch followed by a sed command:
Spatch:
    virtual patch

    @
    depends on !(file in "net")
    disable optional_qualifier
    @

    identifier table_name != {
      watchdog_hardlockup_sysctl,
      iwcm_ctl_table,
      ucma_ctl_table,
      memory_allocation_profiling_sysctls,
      loadpin_sysctl_table
    };
    @@

    + const
    struct ctl_table table_name [] = { ... };

sed:
    sed --in-place \
      -e "s/struct ctl_table .table = &amp;uts_kern/const struct ctl_table *table = \&amp;uts_kern/" \
      kernel/utsname_sysctl.c

Reviewed-by: Song Liu &lt;song@kernel.org&gt;
Acked-by: Steven Rostedt (Google) &lt;rostedt@goodmis.org&gt; # for kernel/trace/
Reviewed-by: Martin K. Petersen &lt;martin.petersen@oracle.com&gt; # SCSI
Reviewed-by: Darrick J. Wong &lt;djwong@kernel.org&gt; # xfs
Acked-by: Jani Nikula &lt;jani.nikula@intel.com&gt;
Acked-by: Corey Minyard &lt;cminyard@mvista.com&gt;
Acked-by: Wei Liu &lt;wei.liu@kernel.org&gt;
Acked-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Reviewed-by: Bill O'Donnell &lt;bodonnel@redhat.com&gt;
Acked-by: Baoquan He &lt;bhe@redhat.com&gt;
Acked-by: Ashutosh Dixit &lt;ashutosh.dixit@intel.com&gt;
Acked-by: Anna Schumaker &lt;anna.schumaker@oracle.com&gt;
Signed-off-by: Joel Granados &lt;joel.granados@kernel.org&gt;
</content>
</entry>
<entry>
<title>pid: allow pid_max to be set per pid namespace</title>
<updated>2024-12-02T10:25:25Z</updated>
<author>
<name>Christian Brauner</name>
<email>christian.brauner@ubuntu.com</email>
</author>
<published>2024-11-22T13:24:58Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=7863dcc72d0f4b13a641065670426435448b3d80'/>
<id>urn:sha1:7863dcc72d0f4b13a641065670426435448b3d80</id>
<content type='text'>
The pid_max sysctl is a global value. For a long time the default value
has been 65535 and during the pidfd dicussions Linus proposed to bump
pid_max by default (cf. [1]). Based on this discussion systemd started
bumping pid_max to 2^22. So all new systems now run with a very high
pid_max limit with some distros having also backported that change.
The decision to bump pid_max is obviously correct. It just doesn't make
a lot of sense nowadays to enforce such a low pid number. There's
sufficient tooling to make selecting specific processes without typing
really large pid numbers available.

In any case, there are workloads that have expections about how large
pid numbers they accept. Either for historical reasons or architectural
reasons. One concreate example is the 32-bit version of Android's bionic
libc which requires pid numbers less than 65536. There are workloads
where it is run in a 32-bit container on a 64-bit kernel. If the host
has a pid_max value greater than 65535 the libc will abort thread
creation because of size assumptions of pthread_mutex_t.

That's a fairly specific use-case however, in general specific workloads
that are moved into containers running on a host with a new kernel and a
new systemd can run into issues with large pid_max values. Obviously
making assumptions about the size of the allocated pid is suboptimal but
we have userspace that does it.

Of course, giving containers the ability to restrict the number of
processes in their respective pid namespace indepent of the global limit
through pid_max is something desirable in itself and comes in handy in
general.

Independent of motivating use-cases the existence of pid namespaces
makes this also a good semantical extension and there have been prior
proposals pushing in a similar direction.
The trick here is to minimize the risk of regressions which I think is
doable. The fact that pid namespaces are hierarchical will help us here.

What we mostly care about is that when the host sets a low pid_max
limit, say (crazy number) 100 that no descendant pid namespace can
allocate a higher pid number in its namespace. Since pid allocation is
hierarchial this can be ensured by checking each pid allocation against
the pid namespace's pid_max limit. This means if the allocation in the
descendant pid namespace succeeds, the ancestor pid namespace can reject
it. If the ancestor pid namespace has a higher limit than the descendant
pid namespace the descendant pid namespace will reject the pid
allocation. The ancestor pid namespace will obviously not care about
this.
All in all this means pid_max continues to enforce a system wide limit
on the number of processes but allows pid namespaces sufficient leeway
in handling workloads with assumptions about pid values and allows
containers to restrict the number of processes in a pid namespace
through the pid_max interface.

[1]: https://lore.kernel.org/linux-api/CAHk-=wiZ40LVjnXSi9iHLE_-ZBsWFGCgdmNiYZUXn1-V5YBg2g@mail.gmail.com
- rebased from 5.14-rc1
- a few fixes (missing ns_free_inum on error path, missing initialization, etc)
- permission check changes in pid_table_root_permissions
- unsigned int pid_max -&gt; int pid_max (keep pid_max type as it was)
- add READ_ONCE in alloc_pid() as suggested by Christian
- rebased from 6.7 and take into account:
 * sysctl: treewide: drop unused argument ctl_table_root::set_ownership(table)
 * sysctl: treewide: constify ctl_table_header::ctl_table_arg
 * pidfd: add pidfs
 * tracing: Move saved_cmdline code into trace_sched_switch.c

Signed-off-by: Alexander Mikhalitsyn &lt;aleksandr.mikhalitsyn@canonical.com&gt;
Link: https://lore.kernel.org/r/20241122132459.135120-2-aleksandr.mikhalitsyn@canonical.com
Signed-off-by: Christian Brauner &lt;brauner@kernel.org&gt;
</content>
</entry>
<entry>
<title>sysctl: treewide: constify the ctl_table argument of proc_handlers</title>
<updated>2024-07-24T18:59:29Z</updated>
<author>
<name>Joel Granados</name>
<email>j.granados@samsung.com</email>
</author>
<published>2024-07-24T18:59:29Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=78eb4ea25cd5fdbdae7eb9fdf87b99195ff67508'/>
<id>urn:sha1:78eb4ea25cd5fdbdae7eb9fdf87b99195ff67508</id>
<content type='text'>
const qualify the struct ctl_table argument in the proc_handler function
signatures. This is a prerequisite to moving the static ctl_table
structs into .rodata data which will ensure that proc_handler function
pointers cannot be modified.

This patch has been generated by the following coccinelle script:

```
  virtual patch

  @r1@
  identifier ctl, write, buffer, lenp, ppos;
  identifier func !~ "appldata_(timer|interval)_handler|sched_(rt|rr)_handler|rds_tcp_skbuf_handler|proc_sctp_do_(hmac_alg|rto_min|rto_max|udp_port|alpha_beta|auth|probe_interval)";
  @@

  int func(
  - struct ctl_table *ctl
  + const struct ctl_table *ctl
    ,int write, void *buffer, size_t *lenp, loff_t *ppos);

  @r2@
  identifier func, ctl, write, buffer, lenp, ppos;
  @@

  int func(
  - struct ctl_table *ctl
  + const struct ctl_table *ctl
    ,int write, void *buffer, size_t *lenp, loff_t *ppos)
  { ... }

  @r3@
  identifier func;
  @@

  int func(
  - struct ctl_table *
  + const struct ctl_table *
    ,int , void *, size_t *, loff_t *);

  @r4@
  identifier func, ctl;
  @@

  int func(
  - struct ctl_table *ctl
  + const struct ctl_table *ctl
    ,int , void *, size_t *, loff_t *);

  @r5@
  identifier func, write, buffer, lenp, ppos;
  @@

  int func(
  - struct ctl_table *
  + const struct ctl_table *
    ,int write, void *buffer, size_t *lenp, loff_t *ppos);

```

* Code formatting was adjusted in xfs_sysctl.c to comply with code
  conventions. The xfs_stats_clear_proc_handler,
  xfs_panic_mask_proc_handler and xfs_deprecated_dointvec_minmax where
  adjusted.

* The ctl_table argument in proc_watchdog_common was const qualified.
  This is called from a proc_handler itself and is calling back into
  another proc_handler, making it necessary to change it as part of the
  proc_handler migration.

Co-developed-by: Thomas Weißschuh &lt;linux@weissschuh.net&gt;
Signed-off-by: Thomas Weißschuh &lt;linux@weissschuh.net&gt;
Co-developed-by: Joel Granados &lt;j.granados@samsung.com&gt;
Signed-off-by: Joel Granados &lt;j.granados@samsung.com&gt;
</content>
</entry>
<entry>
<title>Merge tag 'rcu.2024.07.12a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu</title>
<updated>2024-07-15T22:25:27Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2024-07-15T22:25:27Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=9855e873285f1395b9a04613c56101f04a5ec9fb'/>
<id>urn:sha1:9855e873285f1395b9a04613c56101f04a5ec9fb</id>
<content type='text'>
Pull RCU updates from Paul McKenney:

 - Update Tasks RCU and Tasks Rude RCU description in Requirements.rst
   and clarify rcu_assign_pointer() and rcu_dereference() ordering
   properties

 - Add lockdep assertions for RCU readers, limit inline wakeups for
   callback-bypass synchronize_rcu(), add an
   rcutree.nohz_full_patience_delay to reduce nohz_full OS jitter, add
   Uladzislau Rezki as RCU maintainer, and fix a subtle
   callback-migration memory-ordering issue

 - Remove a number of redundant memory barriers

 - Remove unnecessary bypass-list lock-contention mitigation, use
   parking API instead of open-coded ad-hoc equivalent, and upgrade
   obsolete comments

 - Revert avoidance of a deadlock that can no longer occur and properly
   synchronize Tasks Trace RCU checking of runqueues

 - Add tests for handling of double-call_rcu() bug, add missing
   MODULE_DESCRIPTION, and add a script that histograms the number of
   calls to RCU updaters

 - Fill out SRCU polled-grace-period API

* tag 'rcu.2024.07.12a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: (29 commits)
  rcu: Fix rcu_barrier() VS post CPUHP_TEARDOWN_CPU invocation
  rcu: Eliminate lockless accesses to rcu_sync-&gt;gp_count
  MAINTAINERS: Add Uladzislau Rezki as RCU maintainer
  rcu: Add rcutree.nohz_full_patience_delay to reduce nohz_full OS jitter
  rcu/exp: Remove redundant full memory barrier at the end of GP
  rcu: Remove full memory barrier on RCU stall printout
  rcu: Remove full memory barrier on boot time eqs sanity check
  rcu/exp: Remove superfluous full memory barrier upon first EQS snapshot
  rcu: Remove superfluous full memory barrier upon first EQS snapshot
  rcu: Remove full ordering on second EQS snapshot
  srcu: Fill out polled grace-period APIs
  srcu: Update cleanup_srcu_struct() comment
  srcu: Add NUM_ACTIVE_SRCU_POLL_OLDSTATE
  srcu: Disable interrupts directly in srcu_gp_end()
  rcu: Disable interrupts directly in rcu_gp_init()
  rcu/tree: Reduce wake up for synchronize_rcu() common case
  rcu/tasks: Fix stale task snaphot for Tasks Trace
  tools/rcu: Add rcu-updaters.sh script
  rcutorture: Add missing MODULE_DESCRIPTION() macros
  rcutorture: Fix rcu_torture_fwd_cb_cr() data race
  ...
</content>
</entry>
<entry>
<title>zap_pid_ns_processes: clear TIF_NOTIFY_SIGNAL along with TIF_SIGPENDING</title>
<updated>2024-06-15T17:43:06Z</updated>
<author>
<name>Oleg Nesterov</name>
<email>oleg@redhat.com</email>
</author>
<published>2024-06-08T12:06:16Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=7fea700e04bd3f424c2d836e98425782f97b494e'/>
<id>urn:sha1:7fea700e04bd3f424c2d836e98425782f97b494e</id>
<content type='text'>
kernel_wait4() doesn't sleep and returns -EINTR if there is no
eligible child and signal_pending() is true.

That is why zap_pid_ns_processes() clears TIF_SIGPENDING but this is not
enough, it should also clear TIF_NOTIFY_SIGNAL to make signal_pending()
return false and avoid a busy-wait loop.

Link: https://lkml.kernel.org/r/20240608120616.GB7947@redhat.com
Fixes: 12db8b690010 ("entry: Add support for TIF_NOTIFY_SIGNAL")
Signed-off-by: Oleg Nesterov &lt;oleg@redhat.com&gt;
Reported-by: Rachel Menge &lt;rachelmenge@linux.microsoft.com&gt;
Closes: https://lore.kernel.org/all/1386cd49-36d0-4a5c-85e9-bc42056a5a38@linux.microsoft.com/
Reviewed-by: Boqun Feng &lt;boqun.feng@gmail.com&gt;
Tested-by: Wei Fu &lt;fuweid89@gmail.com&gt;
Reviewed-by: Jens Axboe &lt;axboe@kernel.dk&gt;
Cc: Allen Pais &lt;apais@linux.microsoft.com&gt;
Cc: Christian Brauner &lt;brauner@kernel.org&gt;
Cc: Frederic Weisbecker &lt;frederic@kernel.org&gt;
Cc: Joel Fernandes (Google) &lt;joel@joelfernandes.org&gt;
Cc: Joel Granados &lt;j.granados@samsung.com&gt;
Cc: Josh Triplett &lt;josh@joshtriplett.org&gt;
Cc: Lai Jiangshan &lt;jiangshanlai@gmail.com&gt;
Cc: Mateusz Guzik &lt;mjguzik@gmail.com&gt;
Cc: Mathieu Desnoyers &lt;mathieu.desnoyers@efficios.com&gt;
Cc: Mike Christie &lt;michael.christie@oracle.com&gt;
Cc: Neeraj Upadhyay &lt;neeraj.upadhyay@kernel.org&gt;
Cc: Paul E. McKenney &lt;paulmck@kernel.org&gt;
Cc: Steven Rostedt (Google) &lt;rostedt@goodmis.org&gt;
Cc: Zqiang &lt;qiang.zhang1211@gmail.com&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>Revert "rcu-tasks: Fix synchronize_rcu_tasks() VS zap_pid_ns_processes()"</title>
<updated>2024-06-04T00:30:08Z</updated>
<author>
<name>Frederic Weisbecker</name>
<email>frederic@kernel.org</email>
</author>
<published>2024-04-25T14:24:04Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=9855c37edf0009cc276cecfee09f7e76e2380212'/>
<id>urn:sha1:9855c37edf0009cc276cecfee09f7e76e2380212</id>
<content type='text'>
This reverts commit 28319d6dc5e2ffefa452c2377dd0f71621b5bff0. The race
it fixed was subject to conditions that don't exist anymore since:

	1612160b9127 ("rcu-tasks: Eliminate deadlocks involving do_exit() and RCU tasks")

This latter commit removes the use of SRCU that used to cover the
RCU-tasks blind spot on exit between the tasklist's removal and the
final preemption disabling. The task is now placed instead into a
temporary list inside which voluntary sleeps are accounted as RCU-tasks
quiescent states. This would disarm the deadlock initially reported
against PID namespace exit.

Signed-off-by: Frederic Weisbecker &lt;frederic@kernel.org&gt;
Reviewed-by: Oleg Nesterov &lt;oleg@redhat.com&gt;
Signed-off-by: Paul E. McKenney &lt;paulmck@kernel.org&gt;
</content>
</entry>
<entry>
<title>kernel misc: Remove the now superfluous sentinel elements from ctl_table array</title>
<updated>2024-04-24T07:43:53Z</updated>
<author>
<name>Joel Granados</name>
<email>j.granados@samsung.com</email>
</author>
<published>2023-06-27T13:30:19Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=11a921909fea230cf7afcd6842a9452f3720b61b'/>
<id>urn:sha1:11a921909fea230cf7afcd6842a9452f3720b61b</id>
<content type='text'>
This commit comes at the tail end of a greater effort to remove the
empty elements at the end of the ctl_table arrays (sentinels) which
will reduce the overall build time size of the kernel and run time
memory bloat by ~64 bytes per sentinel (further information Link :
https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/)

Remove the sentinel from ctl_table arrays. Reduce by one the values used
to compare the size of the adjusted arrays.

Signed-off-by: Joel Granados &lt;j.granados@samsung.com&gt;
</content>
</entry>
<entry>
<title>wait: Remove uapi header file from main header file</title>
<updated>2023-12-21T00:26:31Z</updated>
<author>
<name>Matthew Wilcox (Oracle)</name>
<email>willy@infradead.org</email>
</author>
<published>2023-12-11T18:14:41Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=6dfeff09d5ad331905c7066207053d286d58ac83'/>
<id>urn:sha1:6dfeff09d5ad331905c7066207053d286d58ac83</id>
<content type='text'>
There's really no overlap between uapi/linux/wait.h and linux/wait.h.
There are two files which rely on the uapi file being implcitly included,
so explicitly include it there and remove it from the main header file.

Signed-off-by: Matthew Wilcox (Oracle) &lt;willy@infradead.org&gt;
Signed-off-by: Kent Overstreet &lt;kent.overstreet@linux.dev&gt;
Reviewed-by: Christian Brauner &lt;brauner@kernel.org&gt;
</content>
</entry>
<entry>
<title>pid: pid_ns_ctl_handler: remove useless comment</title>
<updated>2023-10-04T17:41:57Z</updated>
<author>
<name>Rong Tao</name>
<email>rongtao@cestc.cn</email>
</author>
<published>2023-09-11T14:55:09Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=2d57792a39e5473c5d98fc0138c4e9ab9c98a57b'/>
<id>urn:sha1:2d57792a39e5473c5d98fc0138c4e9ab9c98a57b</id>
<content type='text'>
commit 95846ecf9dac("pid: replace pid bitmap implementation with IDR API")
removes 'last_pid' element, and use the idr_get_cursor-idr_set_cursor pair
to set the value of idr, so useless comments should be removed.

Link: https://lkml.kernel.org/r/tencent_157A2A1CAF19A3F5885F0687426159A19708@qq.com
Signed-off-by: Rong Tao &lt;rongtao@cestc.cn&gt;
Cc: Aleksa Sarai &lt;cyphar@cyphar.com&gt;
Cc: Christian Brauner &lt;brauner@kernel.org&gt;
Cc: Frederic Weisbecker &lt;frederic@kernel.org&gt;
Cc: Jeff Xu &lt;jeffxu@google.com&gt;
Cc: Kees Cook &lt;keescook@chromium.org&gt;
Cc: Luis Chamberlain &lt;mcgrof@kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
</feed>
