<feed xmlns='http://www.w3.org/2005/Atom'>
<title>user/sven/linux.git/Documentation/admin-guide, branch v5.4.2</title>
<subtitle>Linux Kernel
</subtitle>
<id>https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v5.4.2</id>
<link rel='self' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v5.4.2'/>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/'/>
<updated>2019-11-29T09:09:46Z</updated>
<entry>
<title>x86/speculation: Fix incorrect MDS/TAA mitigation status</title>
<updated>2019-11-29T09:09:46Z</updated>
<author>
<name>Waiman Long</name>
<email>longman@redhat.com</email>
</author>
<published>2019-11-15T16:14:44Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=75cad94d032b9ef500bc12004a3c695a9e85c729'/>
<id>urn:sha1:75cad94d032b9ef500bc12004a3c695a9e85c729</id>
<content type='text'>
commit 64870ed1b12e235cfca3f6c6da75b542c973ff78 upstream.

For MDS vulnerable processors with TSX support, enabling either MDS or
TAA mitigations will enable the use of VERW to flush internal processor
buffers at the right code path. IOW, they are either both mitigated
or both not. However, if the command line options are inconsistent,
the vulnerabilites sysfs files may not report the mitigation status
correctly.

For example, with only the "mds=off" option:

  vulnerabilities/mds:Vulnerable; SMT vulnerable
  vulnerabilities/tsx_async_abort:Mitigation: Clear CPU buffers; SMT vulnerable

The mds vulnerabilities file has wrong status in this case. Similarly,
the taa vulnerability file will be wrong with mds mitigation on, but
taa off.

Change taa_select_mitigation() to sync up the two mitigation status
and have them turned off if both "mds=off" and "tsx_async_abort=off"
are present.

Update documentation to emphasize the fact that both "mds=off" and
"tsx_async_abort=off" have to be specified together for processors that
are affected by both TAA and MDS to be effective.

 [ bp: Massage and add kernel-parameters.txt change too. ]

Fixes: 1b42f017415b ("x86/speculation/taa: Add mitigation for TSX Async Abort")
Signed-off-by: Waiman Long &lt;longman@redhat.com&gt;
Signed-off-by: Borislav Petkov &lt;bp@suse.de&gt;
Cc: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
Cc: "H. Peter Anvin" &lt;hpa@zytor.com&gt;
Cc: Ingo Molnar &lt;mingo@redhat.com&gt;
Cc: Jiri Kosina &lt;jkosina@suse.cz&gt;
Cc: Jonathan Corbet &lt;corbet@lwn.net&gt;
Cc: Josh Poimboeuf &lt;jpoimboe@redhat.com&gt;
Cc: linux-doc@vger.kernel.org
Cc: Mark Gross &lt;mgross@linux.intel.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Cc: Pawan Gupta &lt;pawan.kumar.gupta@linux.intel.com&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: Tim Chen &lt;tim.c.chen@linux.intel.com&gt;
Cc: Tony Luck &lt;tony.luck@intel.com&gt;
Cc: Tyler Hicks &lt;tyhicks@canonical.com&gt;
Cc: x86-ml &lt;x86@kernel.org&gt;
Link: https://lkml.kernel.org/r/20191115161445.30809-2-longman@redhat.com
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
<entry>
<title>Documentation: Add ITLB_MULTIHIT documentation</title>
<updated>2019-11-04T19:26:00Z</updated>
<author>
<name>Gomez Iglesias, Antonio</name>
<email>antonio.gomez.iglesias@intel.com</email>
</author>
<published>2019-11-04T19:26:00Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=7f00cc8d4a51074eb0ad4c3f16c15757b1ddfb7d'/>
<id>urn:sha1:7f00cc8d4a51074eb0ad4c3f16c15757b1ddfb7d</id>
<content type='text'>
Add the initial ITLB_MULTIHIT documentation.

[ tglx: Add it to the index so it gets actually built. ]

Signed-off-by: Antonio Gomez Iglesias &lt;antonio.gomez.iglesias@intel.com&gt;
Signed-off-by: Nelson D'Souza &lt;nelson.dsouza@linux.intel.com&gt;
Signed-off-by: Paolo Bonzini &lt;pbonzini@redhat.com&gt;
Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;

</content>
</entry>
<entry>
<title>kvm: x86: mmu: Recovery of shattered NX large pages</title>
<updated>2019-11-04T19:26:00Z</updated>
<author>
<name>Junaid Shahid</name>
<email>junaids@google.com</email>
</author>
<published>2019-11-04T19:26:00Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=1aa9b9572b10529c2e64e2b8f44025d86e124308'/>
<id>urn:sha1:1aa9b9572b10529c2e64e2b8f44025d86e124308</id>
<content type='text'>
The page table pages corresponding to broken down large pages are zapped in
FIFO order, so that the large page can potentially be recovered, if it is
not longer being used for execution.  This removes the performance penalty
for walking deeper EPT page tables.

By default, one large page will last about one hour once the guest
reaches a steady state.

Signed-off-by: Junaid Shahid &lt;junaids@google.com&gt;
Signed-off-by: Paolo Bonzini &lt;pbonzini@redhat.com&gt;
Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;

</content>
</entry>
<entry>
<title>kvm: mmu: ITLB_MULTIHIT mitigation</title>
<updated>2019-11-04T11:22:02Z</updated>
<author>
<name>Paolo Bonzini</name>
<email>pbonzini@redhat.com</email>
</author>
<published>2019-11-04T11:22:02Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=b8e8c8303ff28c61046a4d0f6ea99aea609a7dc0'/>
<id>urn:sha1:b8e8c8303ff28c61046a4d0f6ea99aea609a7dc0</id>
<content type='text'>
With some Intel processors, putting the same virtual address in the TLB
as both a 4 KiB and 2 MiB page can confuse the instruction fetch unit
and cause the processor to issue a machine check resulting in a CPU lockup.

Unfortunately when EPT page tables use huge pages, it is possible for a
malicious guest to cause this situation.

Add a knob to mark huge pages as non-executable. When the nx_huge_pages
parameter is enabled (and we are using EPT), all huge pages are marked as
NX. If the guest attempts to execute in one of those pages, the page is
broken down into 4K pages, which are then marked executable.

This is not an issue for shadow paging (except nested EPT), because then
the host is in control of TLB flushes and the problematic situation cannot
happen.  With nested EPT, again the nested guest can cause problems shadow
and direct EPT is treated in the same way.

[ tglx: Fixup default to auto and massage wording a bit ]

Originally-by: Junaid Shahid &lt;junaids@google.com&gt;
Signed-off-by: Paolo Bonzini &lt;pbonzini@redhat.com&gt;
Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;

</content>
</entry>
<entry>
<title>x86/speculation/taa: Add documentation for TSX Async Abort</title>
<updated>2019-10-28T07:37:00Z</updated>
<author>
<name>Pawan Gupta</name>
<email>pawan.kumar.gupta@linux.intel.com</email>
</author>
<published>2019-10-23T10:32:55Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=a7a248c593e4fd7a67c50b5f5318fe42a0db335e'/>
<id>urn:sha1:a7a248c593e4fd7a67c50b5f5318fe42a0db335e</id>
<content type='text'>
Add the documenation for TSX Async Abort. Include the description of
the issue, how to check the mitigation state, control the mitigation,
guidance for system administrators.

 [ bp: Add proper SPDX tags, touch ups by Josh and me. ]

Co-developed-by: Antonio Gomez Iglesias &lt;antonio.gomez.iglesias@intel.com&gt;

Signed-off-by: Pawan Gupta &lt;pawan.kumar.gupta@linux.intel.com&gt;
Signed-off-by: Antonio Gomez Iglesias &lt;antonio.gomez.iglesias@intel.com&gt;
Signed-off-by: Borislav Petkov &lt;bp@suse.de&gt;
Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Reviewed-by: Mark Gross &lt;mgross@linux.intel.com&gt;
Reviewed-by: Tony Luck &lt;tony.luck@intel.com&gt;
Reviewed-by: Josh Poimboeuf &lt;jpoimboe@redhat.com&gt;

</content>
</entry>
<entry>
<title>x86/tsx: Add "auto" option to the tsx= cmdline parameter</title>
<updated>2019-10-28T07:37:00Z</updated>
<author>
<name>Pawan Gupta</name>
<email>pawan.kumar.gupta@linux.intel.com</email>
</author>
<published>2019-10-23T10:28:57Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=7531a3596e3272d1f6841e0d601a614555dc6b65'/>
<id>urn:sha1:7531a3596e3272d1f6841e0d601a614555dc6b65</id>
<content type='text'>
Platforms which are not affected by X86_BUG_TAA may want the TSX feature
enabled. Add "auto" option to the TSX cmdline parameter. When tsx=auto
disable TSX when X86_BUG_TAA is present, otherwise enable TSX.

More details on X86_BUG_TAA can be found here:
https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/tsx_async_abort.html

 [ bp: Extend the arg buffer to accommodate "auto\0". ]

Signed-off-by: Pawan Gupta &lt;pawan.kumar.gupta@linux.intel.com&gt;
Signed-off-by: Borislav Petkov &lt;bp@suse.de&gt;
Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Reviewed-by: Tony Luck &lt;tony.luck@intel.com&gt;
Reviewed-by: Josh Poimboeuf &lt;jpoimboe@redhat.com&gt;

</content>
</entry>
<entry>
<title>x86/cpu: Add a "tsx=" cmdline option with TSX disabled by default</title>
<updated>2019-10-28T07:36:58Z</updated>
<author>
<name>Pawan Gupta</name>
<email>pawan.kumar.gupta@linux.intel.com</email>
</author>
<published>2019-10-23T09:01:53Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=95c5824f75f3ba4c9e8e5a4b1a623c95390ac266'/>
<id>urn:sha1:95c5824f75f3ba4c9e8e5a4b1a623c95390ac266</id>
<content type='text'>
Add a kernel cmdline parameter "tsx" to control the Transactional
Synchronization Extensions (TSX) feature. On CPUs that support TSX
control, use "tsx=on|off" to enable or disable TSX. Not specifying this
option is equivalent to "tsx=off". This is because on certain processors
TSX may be used as a part of a speculative side channel attack.

Carve out the TSX controlling functionality into a separate compilation
unit because TSX is a CPU feature while the TSX async abort control
machinery will go to cpu/bugs.c.

 [ bp: - Massage, shorten and clear the arg buffer.
       - Clarifications of the tsx= possible options - Josh.
       - Expand on TSX_CTRL availability - Pawan. ]

Signed-off-by: Pawan Gupta &lt;pawan.kumar.gupta@linux.intel.com&gt;
Signed-off-by: Borislav Petkov &lt;bp@suse.de&gt;
Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Reviewed-by: Josh Poimboeuf &lt;jpoimboe@redhat.com&gt;

</content>
</entry>
<entry>
<title>Merge tag 'for-linus-5.4-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip</title>
<updated>2019-10-12T21:11:21Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2019-10-12T21:11:21Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=680b5b3c5d34b22695357e17b6bdd0abd83e6b1c'/>
<id>urn:sha1:680b5b3c5d34b22695357e17b6bdd0abd83e6b1c</id>
<content type='text'>
Pull xen fixes from Juergen Gross:

 - correct panic handling when running as a Xen guest

 - cleanup the Xen grant driver to remove printing a pointer being
   always NULL

 - remove a soon to be wrong call of of_dma_configure()

* tag 'for-linus-5.4-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
  xen: Stop abusing DT of_dma_configure API
  xen/grant-table: remove unnecessary printing
  x86/xen: Return from panic notifier
</content>
</entry>
<entry>
<title>mm, memcg: proportional memory.{low,min} reclaim</title>
<updated>2019-10-07T22:47:20Z</updated>
<author>
<name>Chris Down</name>
<email>chris@chrisdown.name</email>
</author>
<published>2019-10-07T00:58:32Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=9783aa9917f8ae24759e67bf882f1aba32fe4ea1'/>
<id>urn:sha1:9783aa9917f8ae24759e67bf882f1aba32fe4ea1</id>
<content type='text'>
cgroup v2 introduces two memory protection thresholds: memory.low
(best-effort) and memory.min (hard protection).  While they generally do
what they say on the tin, there is a limitation in their implementation
that makes them difficult to use effectively: that cliff behaviour often
manifests when they become eligible for reclaim.  This patch implements
more intuitive and usable behaviour, where we gradually mount more
reclaim pressure as cgroups further and further exceed their protection
thresholds.

This cliff edge behaviour happens because we only choose whether or not
to reclaim based on whether the memcg is within its protection limits
(see the use of mem_cgroup_protected in shrink_node), but we don't vary
our reclaim behaviour based on this information.  Imagine the following
timeline, with the numbers the lruvec size in this zone:

1. memory.low=1000000, memory.current=999999. 0 pages may be scanned.
2. memory.low=1000000, memory.current=1000000. 0 pages may be scanned.
3. memory.low=1000000, memory.current=1000001. 1000001* pages may be
   scanned. (?!)

* Of course, we won't usually scan all available pages in the zone even
  without this patch because of scan control priority, over-reclaim
  protection, etc.  However, as shown by the tests at the end, these
  techniques don't sufficiently throttle such an extreme change in input,
  so cliff-like behaviour isn't really averted by their existence alone.

Here's an example of how this plays out in practice.  At Facebook, we are
trying to protect various workloads from "system" software, like
configuration management tools, metric collectors, etc (see this[0] case
study).  In order to find a suitable memory.low value, we start by
determining the expected memory range within which the workload will be
comfortable operating.  This isn't an exact science -- memory usage deemed
"comfortable" will vary over time due to user behaviour, differences in
composition of work, etc, etc.  As such we need to ballpark memory.low,
but doing this is currently problematic:

1. If we end up setting it too low for the workload, it won't have
   *any* effect (see discussion above).  The group will receive the full
   weight of reclaim and won't have any priority while competing with the
   less important system software, as if we had no memory.low configured
   at all.

2. Because of this behaviour, we end up erring on the side of setting
   it too high, such that the comfort range is reliably covered.  However,
   protected memory is completely unavailable to the rest of the system,
   so we might cause undue memory and IO pressure there when we *know* we
   have some elasticity in the workload.

3. Even if we get the value totally right, smack in the middle of the
   comfort zone, we get extreme jumps between no pressure and full
   pressure that cause unpredictable pressure spikes in the workload due
   to the current binary reclaim behaviour.

With this patch, we can set it to our ballpark estimation without too much
worry.  Any undesirable behaviour, such as too much or too little reclaim
pressure on the workload or system will be proportional to how far our
estimation is off.  This means we can set memory.low much more
conservatively and thus waste less resources *without* the risk of the
workload falling off a cliff if we overshoot.

As a more abstract technical description, this unintuitive behaviour
results in having to give high-priority workloads a large protection
buffer on top of their expected usage to function reliably, as otherwise
we have abrupt periods of dramatically increased memory pressure which
hamper performance.  Having to set these thresholds so high wastes
resources and generally works against the principle of work conservation.
In addition, having proportional memory reclaim behaviour has other
benefits.  Most notably, before this patch it's basically mandatory to set
memory.low to a higher than desirable value because otherwise as soon as
you exceed memory.low, all protection is lost, and all pages are eligible
to scan again.  By contrast, having a gradual ramp in reclaim pressure
means that you now still get some protection when thresholds are exceeded,
which means that one can now be more comfortable setting memory.low to
lower values without worrying that all protection will be lost.  This is
important because workingset size is really hard to know exactly,
especially with variable workloads, so at least getting *some* protection
if your workingset size grows larger than you expect increases user
confidence in setting memory.low without a huge buffer on top being
needed.

Thanks a lot to Johannes Weiner and Tejun Heo for their advice and
assistance in thinking about how to make this work better.

In testing these changes, I intended to verify that:

1. Changes in page scanning become gradual and proportional instead of
   binary.

   To test this, I experimented stepping further and further down
   memory.low protection on a workload that floats around 19G workingset
   when under memory.low protection, watching page scan rates for the
   workload cgroup:

   +------------+-----------------+--------------------+--------------+
   | memory.low | test (pgscan/s) | control (pgscan/s) | % of control |
   +------------+-----------------+--------------------+--------------+
   |        21G |               0 |                  0 | N/A          |
   |        17G |             867 |               3799 | 23%          |
   |        12G |            1203 |               3543 | 34%          |
   |         8G |            2534 |               3979 | 64%          |
   |         4G |            3980 |               4147 | 96%          |
   |          0 |            3799 |               3980 | 95%          |
   +------------+-----------------+--------------------+--------------+

   As you can see, the test kernel (with a kernel containing this
   patch) ramps up page scanning significantly more gradually than the
   control kernel (without this patch).

2. More gradual ramp up in reclaim aggression doesn't result in
   premature OOMs.

   To test this, I wrote a script that slowly increments the number of
   pages held by stress(1)'s --vm-keep mode until a production system
   entered severe overall memory contention.  This script runs in a highly
   protected slice taking up the majority of available system memory.
   Watching vmstat revealed that page scanning continued essentially
   nominally between test and control, without causing forward reclaim
   progress to become arrested.

[0]: https://facebookmicrosites.github.io/cgroup2/docs/overview.html#case-study-the-fbtax2-project

[akpm@linux-foundation.org: reflow block comments to fit in 80 cols]
[chris@chrisdown.name: handle cgroup_disable=memory when getting memcg protection]
  Link: http://lkml.kernel.org/r/20190201045711.GA18302@chrisdown.name
Link: http://lkml.kernel.org/r/20190124014455.GA6396@chrisdown.name
Signed-off-by: Chris Down &lt;chris@chrisdown.name&gt;
Acked-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Reviewed-by: Roman Gushchin &lt;guro@fb.com&gt;
Cc: Michal Hocko &lt;mhocko@kernel.org&gt;
Cc: Tejun Heo &lt;tj@kernel.org&gt;
Cc: Dennis Zhou &lt;dennis@kernel.org&gt;
Cc: Tetsuo Handa &lt;penguin-kernel@i-love.sakura.ne.jp&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>x86/xen: Return from panic notifier</title>
<updated>2019-10-07T21:53:30Z</updated>
<author>
<name>Boris Ostrovsky</name>
<email>boris.ostrovsky@oracle.com</email>
</author>
<published>2019-09-30T20:44:41Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=c6875f3aacf2a5a913205accddabf0bfb75cac76'/>
<id>urn:sha1:c6875f3aacf2a5a913205accddabf0bfb75cac76</id>
<content type='text'>
Currently execution of panic() continues until Xen's panic notifier
(xen_panic_event()) is called at which point we make a hypercall that
never returns.

This means that any notifier that is supposed to be called later as
well as significant part of panic() code (such as pstore writes from
kmsg_dump()) is never executed.

There is no reason for xen_panic_event() to be this last point in
execution since panic()'s emergency_restart() will call into
xen_emergency_restart() from where we can perform our hypercall.

Nevertheless, we will provide xen_legacy_crash boot option that will
preserve original behavior during crash. This option could be used,
for example, if running kernel dumper (which happens after panic
notifiers) is undesirable.

Reported-by: James Dingwall &lt;james@dingwall.me.uk&gt;
Signed-off-by: Boris Ostrovsky &lt;boris.ostrovsky@oracle.com&gt;
Reviewed-by: Juergen Gross &lt;jgross@suse.com&gt;
</content>
</entry>
</feed>
