diff options
| author | Huang Ying <ying.huang@linux.alibaba.com> | 2025-11-14 16:54:03 +0800 |
|---|---|---|
| committer | Catalin Marinas <catalin.marinas@arm.com> | 2025-11-19 16:01:48 +0000 |
| commit | cb1fa2e999558fd93b519f7c4c16e75e805af1e6 (patch) | |
| tree | 64bef432065105181fbd4149ad3531111264dabe /tools/docs/parse-headers.py | |
| parent | 79301c7d605a10efea35af08167e0a362d8dffb1 (diff) | |
arm64, tlbflush: don't TLBI broadcast if page reused in write fault
A multi-thread customer workload with large memory footprint uses
fork()/exec() to run some external programs every tens seconds. When
running the workload on an arm64 server machine, it's observed that
quite some CPU cycles are spent in the TLB flushing functions. While
running the workload on the x86_64 server machine, it's not. This
causes the performance on arm64 to be much worse than that on x86_64.
During the workload running, after fork()/exec() write-protects all
pages in the parent process, memory writing in the parent process
will cause a write protection fault. Then the page fault handler
will make the PTE/PDE writable if the page can be reused, which is
almost always true in the workload. On arm64, to avoid the write
protection fault on other CPUs, the page fault handler flushes the TLB
globally with TLBI broadcast after changing the PTE/PDE. However, this
isn't always necessary. Firstly, it's safe to leave some stale
read-only TLB entries as long as they will be flushed finally.
Secondly, it's quite possible that the original read-only PTE/PDEs
aren't cached in remote TLB at all if the memory footprint is large.
In fact, on x86_64, the page fault handler doesn't flush the remote
TLB in this situation, which benefits the performance a lot.
To improve the performance on arm64, make the write protection fault
handler flush the TLB locally instead of globally via TLBI broadcast
after making the PTE/PDE writable. If there are stale read-only TLB
entries in the remote CPUs, the page fault handler on these CPUs will
regard the page fault as spurious and flush the stale TLB entries.
To test the patchset, make the usemem.c from
vm-scalability (https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git).
support calling fork()/exec() periodically. To mimic the behavior of
the customer workload, run usemem with 4 threads, access 100GB memory,
and call fork()/exec() every 40 seconds. Test results show that with
the patchset the score of usemem improves ~40.6%. The cycles% of TLB
flush functions reduces from ~50.5% to ~0.3% in perf profile.
Signed-off-by: Huang Ying <ying.huang@linux.alibaba.com>
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Acked-by: Zi Yan <ziy@nvidia.com>
Cc: Will Deacon <will@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Yang Shi <yang@os.amperecomputing.com>
Cc: Christoph Lameter (Ampere) <cl@gentwo.org>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Kevin Brodsky <kevin.brodsky@arm.com>
Cc: Yin Fengwei <fengwei_yin@linux.alibaba.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Reviewed-by: David Hildenbrand (Red Hat) <david@kernel.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Diffstat (limited to 'tools/docs/parse-headers.py')
0 files changed, 0 insertions, 0 deletions
